Linux Performance Troubleshooting
When a linux server is having performance issues, it’s important to determine the root cause of the issue. Linux Performance Troubleshooting involves determining if the load is:
- CPU bound (processes waiting for CPU resources)
- RAM-bound (high RAM usage moved to swap)
- I/O bound (processes fighting for disk or network I/O)
Troubleshooting Server bottlenecks
- Use top/iotop to identify what resources you are running out of (CPU, RAM or Disk I/O)
- Identify what processes are consuming the most resources (CPU, RAM or Disk I/O).
CPU Bound Load
CPU load is equal to the number of processes in a runnable or uninterruptible state. Linux system tools show us the “load average”, which is an average of the computer’s load (number of processes in a runnable or uninterruptible state) over several periods of time (1, 5 and 15 minute load averages).
Example: If you have one CPU present on a server and that core is utilization 100%, this system is at a load of 1.
From a LAMP perspective, web servers are usually CPU bound. If you receive a bust of traffic, you will see a spike as the apache processes competing for system resources. As the traffic bust lessens, the apache processes complete their requests the load comes back down.
How do you find the CPU load of a linux server?
1) uptime (1, 5 and 15 minute load averages). [root@test ~]$ uptime 19:58:05 up 270 days, 2:38, 2 users, load average: 2.55, 2.37, 1.87
Tip: Systems seem to be more responsive under CPU-bound load than when under I/O-bound load.
2) top
top reads the load average from /proc/loadavg
[root@test ~]$ cat /proc/loadavg 1.43 2.04 1.80 1/664 10619
When you run top, the load average is displayed on the top right corner:
However the load average doesn’t really tell us much unless we consider the total number of cores. Generally you want the load number to be less than the number of CPU(s)/cores you have. When the load average is over the CPU core count, this means the CPU(s) are utilized to the max and that the workload is being queued.
Finding the number of CPU cores
1) Run top, Press 1. This give you a breakout of the total number of cores and their usage.
2) You can also use the nproc command (http://www.cyberciti.biz/faq/linux-get-number-of-cpus-core-command/):
[root@test ~]$ nproc 16
In the example above, we have 16 cores. If the load is over 16, we know there is queuing and the CPUs are maxed.
Which process is causing the high CPU load?
If the load is CPU-bound, use top to display which processes are consuming the most CPU. top sorts processes by their CPU usage by default. Hit the F key to change to a screen there you can choose the sort column.
If a web server is experiencing heavy load and the apache processes are consuming the most CPU, you can use profiling tools like New Relic to determine what the apache processes are doing and which requests are the most expensive.
RAM-bound Load:
When free available memory on a server drops, there is no more room in memory and the system begins to swap. Swapping is when a page in memory is written to disk to free up memory. Compared to RAM memory, hard drive disks are very slow. Because a process running in memory is an order of magnitude faster than disk, the system usually slows down considerably. The more swapping , the slower a system becomes.
Because swapping leads to high I/O, when you see high I/O it’s important to determine if this is because the system is swapping or read/writes to the disks are high (I/O Bound workload).
Using top, you can see how much memory the system has and how much is free. Use the Mem row to see the total, used and free memory on a system.
[root@test ~]$ top top - 19:51:48 up 270 days, 2:31, 2 users, load average: 2.11, 1.77, 1.50 Tasks: 440 total, 1 running, 439 sleeping, 0 stopped, 0 zombie Cpu(s): 8.6%us, 1.1%sy, 0.0%ni, 90.1%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 24591916k total, 19192916k used, 5399000k free, 441868k buffers Swap: 2097144k total, 506732k used, 1590412k free, 6147300k cached
You can also use the free -m command to see memory information (including buffers/cache). To find out how much actual RAM is really being used by processes, you must subtract the file cache from the used RAM.
[root@test ~]# free -m total used free shared buffers cached Mem: 498 93 405 0 15 32 Low: 498 93 405 High: 0 0 0 -/+ buffers/cache: 44 453 Swap: 1023 0 1023
Is the system currently swapping?
You can use vmstat to determine if a server is currently swapping. The two columns under “swap” –
- “si” shows you swap in
- “so” shows you swap out
Swap in/out numbers should be zero. If they are greater than zero, the system is swapping.
Out-of-memory (OOM)
When memory is dangerously low, Linux kernel will run the out-of-memory (OOM) process which starts killing processes to free memory.
Any action taken by the OOM are stored in the /var/log/message* log file:
grep -i kill /var/log/messages*
Which process is causing the high Memory load?
If the load is RAM-bound, use top to display which processes are consuming the most Memory. To sort top by memory, hit M key on your keyboard which sorts by RAM usage. Hit the F key to change to a screen there you can choose the sort column.
I/O Bound Load:
When a system is I/O bound, the system is spending a large amount of CPU time waiting for I/O (either network or disk). If the output from top for I/O wait is low, you can rule out I/O as the reason for performance issues.
From a LAMP perspective, database servers are usually I/O bound. To keep I/O on database server to a minimum, you should try to keep your “working-set” of data in memory. If this becomes difficult on one server, you may have to consider sharding to spread your data across multiple servers. This will scale out both writes/reads.
top (I/O wait)
Use top to determine the current percentage of CPU(s) in iowait :
[root@test ~]$ top top - 20:38:26 up 14 min, 1 user, load average: 0.00, 0.00, 0.00 Tasks: 103 total, 1 running, 102 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 0.3%sy, 0.0%ni, 98.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1922320k total, 564172k used, 1358148k free, 10916k buffers Swap: 1254392k total, 0k used, 1254392k free, 236116k cached
Iostat
iostat – linux tool used to determine which hard drive partition has the highest I/0 activity.
iostat -x 1 - This will run iostat every 1 second with extended results. The number under %iowait will tell us how much iowait the system is experiencing.
tps - Transfer per second. Transfer is the same as I/O requests sent to the device. blk_read/s - Block reads from the device per second. blk_wrtn/s - Blocks written to the device per second. blk_read - Total number of blocks read from the device blk_wrtn - Total number of blocks written to the device.
Which process is causing the high I/O?
[root@test ~]# iotop
Note: iotop is not installed by default on most RHEL machines. To install:
yum install iotop
iotop will sort processes causing the highest I/O utilization:
If a process is waiting for I/O, you can use ps to view its process state. A “D” state (uninterrupted sleep) implies it’s waiting for I/O:
[root@test ~]# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND apache 1256 0.0 0.4 493136 8120 ? D 20:24 0:00 myprocess
You can then use that process id to see its I/O stats:
[root@test ~]# cat /proc/1256/io rchar: 600 wchar: 0 syscr: 3 syscw: 0 read_bytes: 10000 write_bytes: 10000 cancelled_write_bytes: 0
To view files open by a process id (lsof) :
[root@test ~]$ lsof -p 1256 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME php 12107 root cwd DIR 253,1 4096 9439252 /root php 12107 root rtd DIR 253,1 4096 2 / php 12107 root txt REG 253,1 3871784 14687350 /usr/bin/php php 12107 root 1w REG 253,1 961407489 4221807 /var/www/html/website/tmp/logs/some_log.log
The Performance Troubleshooting Process
- First look at I/O wait.
- If I/O wait is high:
- If the system is swapping:
- Determine which processes are consuming the most Memory.
- If the system is NOT swapping:
- Determine which processes are consuming the most I/O.
- If the system is swapping:
- If I/O wait is low:
- If idle percentage times are low:
- Determine which processes are consuming the most CPU
- If idle percentage times are high:
- If memory is low, determine which processes are consuming the most Memory.
- Look for network issues or other problems.
- If idle percentage times are low:
- If I/O wait is high:
Other Helpful Disk commands
fdisk -l - partition information
/etc/fstab - System filesystems
pvdisplay
- physical volume display
df -h - list all your mounted partitions along with their size
df /tmp
- determine filesystem /tmp folder is using
du -sh - Disk utilization in the form of a summary
du -ckx | sort -n > /tmp/large_directories - Track down the largest directories
ls -lShr - list all of the files sorted by their size
sudo sh -c “> /var/log/messages” - Truncate file
Troubleshoot Historical Load Issues
Sar is an excellent tool which stores historical system information. This program requires the sysstat package on RHEL:
yum install sysstat
/etc/default/sysstat or /etc/sysconfig/sysstat
stores in /var/log/sysstat or /var/log/sa from /etc/cron.d/sysstat
sar command
sar - Default outputs the CPU statistics for the current day
sar -r - Display RAM statistics for the current day
sar -b - Display I/O information
sar -A - Output all the statics from load average, CPU load, RAM, disk I/O, network I/O and all sorts of other interesting statistics
sar -s 20:00:00 -e 20:30:00 - start and end dates ranges
sar -f /var/log/sysstat/sa04 - pull data from the statistics on the sixth of the month.
sar -r -s 13:00:00 -e 13:20:00 - Memory Utilization for start and end date range
sar -q -s 13:10:00 -e 13:20:00 - Processor Utilization for start and end date range