Monday, May 17, 2010

The real problem is not always obvious.

Recently at work I helped out with a project to move the IP address of a slave name server. After this move went with no problem the main master name server crashed, or appeared to be down. I wrote up a report on this outage and the steps how I found the real hidden problem using simple troubleshooting steps. Please note I changed the IP addresses and domain names for privacy.

Brief history of the environment

The San Francisco site has two name servers, dns1 (10.10.10.1) and dns2 (10.10.10.2), both running a flavor of RHEL and Bind version 9. The servers manage the child.parent.com domain and are a slave servers for the parent.com domain managed in Tokyo. In the days leading up to the outage, we added a Nagios client for both servers, and changed the IP address of dns2. The IP address was changed on a Wednesday from 10.10.10.2 to 20.20.20.2.

First problem appears

On Monday, the users alerted us around 6:00am that our VPN site was down. At first this appears to be an network problem but users also found that the external sites were also down. At this point the network engineer verify that the network was up, and all sites were accessible using IP address. The sites are not resolving by name only by IP address from external test sites as well. In fact no external names of the child.parent.com domain could be resolved.

Trying to check the dns1 server, I could not login using SSH and walked to the data center to the console. There I found that the server was filled with messages of the following.



At first I could not access the command prompt, so I restarted the server using the power switch but forgot to record the exact information displayed on the screen. The actual screenshot above was take later. Once the server was restarted, all of the external sites were able to resolve the names again and every thing is back to normal. All 5 external DNS servers are now resolving all of the external sites, we tested this using tools from DNS stuff.

Problem appeared to be solved?

At first we believed the problem was resolved and just a random server crash, but seems too much coincidence to just crash 2 days after the IP address change? I had a good feeling that an IP configuration change would not cause the problem but didn't feel confident that the server was 100% stable.

The first question was why did all 5 external DNS servers fail to resolve any name under the child.parent.com? First this was answered by looking at the dns1 server's Bind named.conf file as shown below. The server dns1 is the master server for child.parent.com, and the setting "expire" is configured for 2 days. This means that if dns1 is down for two days or more, all of the slave servers will erase the child.parent.com domain zone from their records. The value here is 172,800 which is in seconds.



So now we know the configuration of Bind was setup to erase the slave servers after 2 days of not reporting back to the slaves, this must mean the server failed sometime at least 2 days before Monday, most likely Friday night.

While viewing the syslog of dns1 I found numerous SSH login attempts but nothing that specifically reported errors with the system. During the team meeting it was determined by management since the last change was the IP address change of dns2 from 10.10.10.2 to 20.20.20.2 but I didn't think this change would cause another server to crash.

As the day was closing out, I started the research of the logs and delayed the IP change back until the next day when I had everything recorded from the day's problem.

The problem appears again

That night around 12:30am I was working on the report, I had a VPN connection to the office from home, with a SSH connection to dns1 reading the logs on the server and writing the outage documentation. While working on dns1, my connection dropped suddenly. I knew there was a timeout for SSH connections but when I tried to reconnect I could not, also the server was not replying to any ping either. At this point I knew I caught the problem but I also had to drive into work. lol Figured if I didn't go in I wouldn't see exactly what was on the console.

At work, I found the following error message. It's the same as before but I forgot to write it down before, so I took a photo using my phone. Also this time I was able to control-c out the message and access the console.



Ok, now we have a major clue to get started. I started by searching Google.com for the phrase "ip_conntrack table full dropping packet" which came back with many results. The information came back that ip_conntrack was a process of IPTABLES, the internal firewall of a Linux based system. The error message is the server getting overloaded trying to process too many IP packets.

From recommendations I found that the sysctl.conf file can be adjusted to suit the server and tuned for it's physical memory size. The dns1 server has 1GB of physical memory and the currently using the default setting of 65,528 connections. The error message indicated all 65,528 connections were filled so I increased the value to 98,000 and I also shorten the timeout value from 432,000 to 28,800 so connections will disconnect faster. After making the changes I restarted the networking services, and then checked the settings using the following command.

/sbin/sysctl -a | grep conntrack

Right after I checked the settings, I found all of the new entries were valid and working, so I wanted to see how many free ports were available. I used the following command to find out.

wc -l /proc/net/ip_conntrack

I was surprised to find out that all 98,000 ports were now taken up again just seconds after I added them! At this point I knew there had to be something causing extremely heavy traffic on the server. I felt it was a denial of service attack but had no proof. So I began to look at what is the server doing locally. First I did the simple command to find out the top processes and general overview of the server.

top



Ahh, now we see that there are multiple processes called "ssh-scan" running on the server. Now I wanted to know how many were running. I ran the following command.

ps aux | grep ssh-scan

The result was a listing of over 200 processes of ssh-scan running on the server. At this point I knew that it was highly unlikely that any member of the IT staff would be scanning servers so I was almost certain it was done maliciously. It's important to note that this tool ssh-scan is not like NMAP, where it has legitimate uses, this ssh-scan is just a brute force scanner.

Since I knew the purpose of the tool and what it does I wanted to see dns1's network activity. Running the following command showed me the results.

netstat -n



Looks like it is trying to connect to random servers on the Internet via SSH. Now I wanted to see just activity on port 22, SSH. So I used the following command.

/usr/sbin/lsof -w -n -I tcp:22

ssh-scan 4843 root 7u IPv4 424670 TCP 10.10.10.1:33835->115.238.100.174:ssh (ESTABLISHED)
ssh-scan 4846 root 7u IPv4 424767 TCP 10.10.10.1:43949->115.238.100.190:ssh (SYN_SENT)

At this point, I was almost 100% certain that there was either a root kit or some malicious process installed on dns1. I wanted to figure out where it was located but my attempts to locate it were not successful. I then started to search Google again for the solution when I stumbled across a posting how to run an ssh scanner. At the following site, the poster showed basic steps to deploy and setup a remote SSH scanner on a compromised server, basically what I have in front of me. Ironically his detailed steps gave me the biggest clues how to find the application.

Here's the link to the forum posting.
http://www.governmentsecurity.org/forum/index.php?showtopic=11026

Then the important clue from the posting.

then type the following
cd /usr/man/man3/
and then :
mkdir ". hiden"
and then :
cd "..."
This is an hidden dir so the Sysop wont notice

Using the clue that was posted, I guessed the ssh scanning application would be installed somewhere under /usr. I checked around but didn't find anything until I stumbled across /usr/tmp. It was just a hunch but I was thinking where would you hide an important file? In the temp directory. Also I also thought of the movie Hackers and how the file was placed in a garbage directory to be out of sight.

I then ran the command to view hidden directories.

ls -a



Ahh, now we found something, and there's a directory called "ssh-scan"! So what is this application really doing?



Bios.txt - Very long listing of IP addresses
Nfu.txt - Very long listing of random IP addresses
Pass_file - Dictionary attack file
Spd - Script
Vuln.txt - Appears to hold account names, passwords and IP addresses of cracked systems.

The files and directory were deleted but now wanted to find out how this security breech really happened.

Resolving the security issue

I knew that the root account does not have SSH access, so I had to find where else the user got access to the server dns1. From the /var/log/messages I found many entries of "failed password" from various IP addresses from every possible account name from "root" to random names like "paulsmith". I exported the logs and searched the successful logins, here I found there were successful logins from two accounts, "siteadmin" and another "service". Note, the name service is not the actual name.

While looking at the two account, "siteadmin" is the account we use to login here locally, and then SU to root. The account was accessed all from the internal IP address so we knew that this account was not used in the security breech. But the other account "service" was logged in from various IP addresses that we did not own or knew about. The account "service" was created for the local service agent, to monitor the hardware on the server. The problem was how they access the service account and why is anyone logging in from the service account, which should not have remote SSH access.

I removed the access for the problem account and removed login rights by changing the passwd file and giving the service account "nologin". Changed the root and siteadmin passwords, also disable SSH since no one really needed it, the server is just a few feet inside the data center. Digging deeper, we found that the service account was added to the "wheel" group, giving the account SSH remote login access, and that the account did not follow our normal password security standards.

Also to compound the issue, the external firewall was allowing SSH to the servers on the DMZ, which was not known to the other users.

Why and how could this happen?

It's very easy to assume that everything is done normally and there shouldn't be any problems. The biggest problem is that unless you installed the server or deployed the firewall rules, you never really know for sure. I've experienced this many times and at different job sites, it's always eye opening.

A few of the problems we faced were different departments working on the server at the same time. Basically when one group of people access the system, they might not know about some details the other team would know about, for example the server allowing SSH access from the Internet. Another issue was the lack of details everyone knew about the security.

For our department we created the service account and the generic password on the request of another team, but was not added to any SSH access group. Another department might have been troubleshooting the server for another issue and figured they add service account to SSH access, to make troubleshooting easier. Finally since the SSH is open to the outside world, malicious users could port scan trying to get in. The IPS devices were not monitoring for this type of attack so it was not picked up. Then just by the amount of tried, eventual the attacker would find a password that worked and gain access to the server, deploy the SSH scanning tool and start maxing out our server's resources.

It's a bit irony that since the name server was so old, it alerted us to the outage since even a little bit of scanning was overloading the server to the point it would drop off the network.

What to do in the future?

Monitor your logs! I saw the SSH attempts earlier but didn't think that someone would add a very easy to crack password to the SSH access group. Now since I know the dangers of SSH, I disabled it only to turn on the service when needed. Also check the passwords on your system, and make sure who has what access, don't assume everything is the same.

While it was a tough problem to solve, especially since I don't have much Linux administration experience, it was very interesting and using my experience from Windows administration it was a matter of knowing that correct commands. I knew that something was running, but what? Searched for the high process application, then once it was found, needed to know what it was doing. After that how did they gain access, from there I searched the logs and found that IP addresses for one account did not match the other, which was a clue where the login were coming from.

I intend to really learn more about Linux administration, and if this happens again, resolve the problem much quicker. So far, it's a welcome intro to security problems with Linux.