Wednesday, March 30, 2011

Disk Performance Trending in Linux:

The Disk i/o system is often the slowest subsystem on the computer and one of the biggest bottlenecks in system performance. Disk i/o is critical for certain applications,.

This document is a summary of the key factors that affect overall disk performance on the system and references for some tools that can be used to measure these factors on Linux Servers.

Disk Throughput

hdparm has a lot of features , the -t switch is useful for getting a general baseline on the speed of the disk system; it's a simple matter of passing it the device name of the drive you want to test.

hdparm -t /dev/sda1

This first example is done with a locally attached SATA2 - 5400 RPM drive (2TB) drive (no raid/striping etc. just a single local disk) (EXT3)

Timing buffered disk reads: 154 MB in 3.01 seconds = 51.14 MB/sec

 

This next result shows the difference when testing a SATA2 SDD (96GB) drive on the same system (again, just a single disk, no striping) (EXT3)

Timing buffered disk reads: 410 MB in 3.01 seconds = 136.44 MB/sec

 

Note a significant difference when talking to the SDD drive - The SATA2 bus itself is capable of providing up to a theoretical 3.0Gbps (Sata3 is supposed to be able to deliver up to 6Gbps but I don't have a system with that to test with yet.. The fastest SCSI standard today tops out at 5.12 Gbps

Here is the same drive in the same system running the EXT2 file system. Note the journaling feature in EXT3 causes additional write activity that is detrimental to SSD drives and is best avoided if possible

Timing buffered disk reads: 437 MB in 3.03 seconds = 144.22 MB/sec

 

for comparison purposes i remoted in to my customers network and ran the tool on a few of their systems -

This is the results from measurements on a few production servers at a customer site - these are running HD Proliant ML350 G3 servers with the HP Smart Array Controllers and SCSI drives in a RAID 5 array.

 

Timing buffered disk reads: 154 MB in 3.03 seconds = 50.88 MB/sec
Timing buffered disk reads: 268 MB in 3.00 seconds = 89.32 MB/sec
Timing buffered disk reads: 126 MB in 3.00 seconds = 41.99 MB/sec
Timing buffered disk reads: 142 MB in 3.01 seconds = 47.15 MB/sec
Timing buffered disk reads: 157 MB in 3.01 seconds = 52.19 MB/sec
Timing buffered disk reads: 137 MB in 3.00 seconds = 45.79 MB/sec

 

Note that the overall read performance above are in most cases close to the same speed I was getting with my SATA 7200 RPM disk tested in my home lab. Below we show some slight improvement after doing some tuning

Timing buffered disk reads: 218 MB in 3.00 seconds = 72.59 MB/sec
Timing buffered disk reads: 277 MB in 3.03 seconds = 92.41 MB/sec
Timing buffered disk reads: 146 MB in 3.00 seconds = 48.63 MB/sec
Timing buffered disk reads: 158 MB in 3.00 seconds = 52.80 MB/sec
Timing buffered disk reads: 170 MB in 3.02 seconds = 56.27 MB/sec

 

Results below are from a personal computer running a software raid (Raid 5) array with 4x10,000RPM Western Digital VelociRaptor drives. (note that the first set of results came from the main / partition, the second set was from the swap partition.. It was the same array spread across the same 4 drives but with notably different results.

 

Timing buffered disk reads: 1240 MB in 3.00 seconds = 413.26 MB/sec (/)
Timing buffered disk reads: 2326 MB in 3.00 seconds = 774.94 MB/sec (/swap)
 
With a single 10,000 RPM VelociRaptor we got the following results
Timing buffered disk reads: 356 MB in 3.02 seconds = 117.84 MB/sec

 

Oh, and... i didn't do this test myself however... a 26 disk RAID array running Samsung 256GB SSD drives was shown delivering....

Timing buffered disk reads: 6056 MB in 3.00 seconds = 2018.70 MB/sec
 
Yes, that does say over 2 GB per second.... -

Disk i/o statistics

iostat (installed by default in opensuse 11 but not in SLES10/OES2 or SLES9) reports on the i/o event statistics. - This is part of the sysstat package and i think it should be installed by default as iostat and sar are incredibly useful tools - you can get the package here if you don't have it on your server http://www.novell.com/products/linuxpackages/server10/i386/sysstat.html

sles11-lab1:/usr/share/doc/packages/sysstat # iostat

Linux 2.6.32.12-0.7-pae (home) 03/17/11 _i686_

avg-cpu: %user %nice %system %iowait %steal %idle

1.43 0.01 0.36 0.40 0.00 97.81

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 1.17 12.69 26.50 56968121 118897930

sdb 0.00 0.00 0.00 15244 2810

fd0 0.00 0.00 0.00 64 0

opesuse11-lab1:/usr/share/doc/packages/sysstat # iostat -x 1 4

Linux 2.6.32.12-0.7-pae (home) 03/17/11 _i686_

avg-cpu: %user %nice %system %iowait %steal %idle

1.43 0.01 0.36 0.40 0.00 97.81

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-s z await svctm %util

sda 0.03 2.50 0.36 0.81 12.69 26.49 33.42 0.0 1 9.34 3.78 0.44

sdb 0.00 0.00 0.00 0.00 0.00 0.00 18.88 0.0 0 6.25 4.36 0.00

fd0 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.0 0 34.00 34.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle

0.00 0.00 0.00 0.00 0.00 100.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-s z await svctm %util

sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

fd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle

0.00 0.00 0.00 0.00 0.00 100.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-s z await svctm %util

sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

fd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle

0.00 0.00 0.00 0.00 0.00 100.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-s z await svctm %util

sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

fd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00 0.00

 

The output from iostat includes measurements for the following

  • tps (number of tranfers (i/o requests) per second)

  • Blk_read/s (number of blocks read per second from the device)

  • Blk_wrtn/s (number of blocks written per second from the device)

  • Blk_read (total # of blocks read from the device since startup)

  • Blk_wrtn (total # of blocks written to the device since startup)

  • %iowait (how much time the cpu is waiting on the disk)

Note that there are other stats on the screen but they are not specifically relevant to disk performance.

The Second example above uses the command “iostat -x 1 4” the -x gets “extended” stats, the “1 4” indicates that the command will gather statistics 4 times, 1 second apart. The extended reports gathers the stats above plus a few others of note:

  • rrqm/s - The number of read requests merged per second that were queued to the device.

  • wrqm/s - The number of write requests merged per second that were queued to the device.

  • r/s - The number of read requests that were issued to the device per second

  • w/s - The number of write requests that were issued to the device per second.

  • rsec/s - The number of sectors read from the device per second.

  • wsec/s - The number of sectors written to the device per second.

  • Avqrq-sz - The average size (in sectors) of the requests that were issued to the device.

  • Acqqu-sz - The average queue length of the requests that were issued to the device.

  • Await - The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.

  • Svctm - average service time (in ms) for I/O requests that were issued to the device.

  • %util - Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

Notes:

  • iostat statistics represent all disk io to all storage where the hdparm command represents performance of a specific device.

  • iostat reports what the system is actually using where hdparm generates disk traffic to measure how fast it is

  • the first line of output from iostat represents stats since the server was started. Subsequent lines show the stats during the sampling interval.

  • If a disk is doing a large number of transfers (the tps field) but reading and writing only small amounts of data (the bps field), examine how your applications are doing disk I/O. The application may be performing a large number of I/O operations to handle only a small amount of data. You may want to rewrite the application if this behavior is not necessary.

Virtual Memory Statistics

vmstat displays stats about several things including paging and block i/o - paging refers to the usage of the swap partition and is an indication of your memory statistics however it is a factor in the use of the disk system. If you have heavy usage of the swap space it will negatively affect the rest of your disk performance. So - if you see high disk io and think it might be the problem, you might want to see if the swap io is also high since that more likely means that memory is the chokepoint

 

OES2-home:/proc # vmstat

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

0 0 7276 422976 20404 299920 0 0 6 4 3 9 1 0 98 0 0

The main things to check here are:

  • si - swap in

  • so - swap out

  • bi - blocks in

  • bo - blocks out

Notes:

  • the first line of output from vmstat represents stats since the server was started. Subsequent lines show the stats during the sampling interval.

OES2-home:/var/log # vmstat 1 6

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

0 0 7276 348176 83412 311620 0 0 6 4 3 1 1 0 98 0 0

0 0 7276 348344 83412 311620 0 0 0 0 39 54 0 0 100 0 0

0 0 7276 348368 83412 311620 0 0 0 0 32 53 0 0 100 0 0

0 0 7276 348368 83412 311620 0 0 0 0 40 71 1 2 97 0 0

0 0 7276 348368 83412 311624 0 0 0 0 31 49 0 0 100 0 0

0 0 7276 348368 83412 311624 0 0 0 0 33 53 0 0 100 0 0

Swappiness

Often users will notice performance degradation in their applications when their system exceeds roughly 40-50% of its ram consumption. This is because the default settings for swappiness tells the system to start using swap space (space preallocated on the hard disk) to store program memory instead of the much faster RAM. This can be seen graphically in gkrellm by monitoring the filled portion of the "swap" meter which is located directly underneath the memory meter. In addition to degrading system responsiveness, the use of swap space can greatly affect battery life of laptops as well due to the amount of power it takes to access the hard drives on the system.

The fact that Linux starts using swap space when any physical memory is left at all may seem very counter-intuitive to most users (as it did to myself at first). Linux, being a server-oriented operating system, is by default tuned to deliver high performance to background applications at the expense of foreground applications. This means that your word processor, mp3 player, kde desktop manager, doom3 video game, and any other "foreground" application will start to be swapped out at the earliest sign of rising memory consumption so that the system background services can run smoothly. For the average desktop user, this is almost always not what you want. Short of turning swap space completely off (which is not recommended), Linux allows us the ability to fine-tune the likelihood of swap space being used at all.

To check the swappiness value on your system, run the following command on the terminal.

cat /proc/sys/vm/swappiness

default is 60 on most SUSE systems - If you have swap activity on systems with plenty of RAM it often helps to lower this to ~10

/proc/diskstats

the /proc/diskstats file is a virtual file in a virtual subdirectory - the contents of /proc are read directly from the kernel and do not actually exist on disk. The counters in diskstats are constantly updated and this file shows the following information about each mounted storage device.

  • # of reads issued

  • # of reads merged, field 6 -- # of writes merged

  • # of sectors read

  • # of milliseconds spent reading

  • # of writes completed

  • # of sectors written

  • # of milliseconds spent writing

  • # of I/Os currently in progress

  • # of milliseconds spent doing I/Os

  • weighted # of milliseconds spent doing I/Os

sar

sar is installed in sles11 but not in sles10 and earlier - it is part of the sysstat package and can be added fairly easily. It does need to be configured to run a cron task to create the report files in /var/log/sa (sa2 is the process to create the files) - once this has run you can use sar to report from those files. Running sar -d gives an output below. You can download sysstat here if you don't have it on your server already - http://www.novell.com/products/linuxpackages/server10/i386/sysstat.html

 

03:47:56 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz   await     svctm     %util
03:47:59 PM dev8-0 66.56 3480.79 5.30 52.38 0.21 3.08 2.79 18.54
03:47:59 PM dev8-1 1.66 37.09 0.00 22.40 0.03 18.40 7.20 1.19
03:47:59 PM dev8-2 64.90 3443.71 5.30 53.14 0.17 2.69 2.67 17.35
03:47:59 PM dev8-16 460.60 117827.81 42.38 255.91 1.63 2.73 1.82 83.97
03:47:59 PM dev8-17 460.60 117827.81 42.38 255.91 1.63 2.73 1.82 83.97
03:47:59 PM dev11-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:47:59 PM dev8-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:47:59 PM dev8-33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00




  • DEV - The physical device in question





  • tps - number of tranfers (i/o requests) per second







  • rd_sec/s - Number of sectors (1 sector = 512 bytes) read per second





  • wr_sec/s - Number of sectors written per second





  • avgrq-sz - Average number of sectors issued to the device





  • avgqu-sz - Average queue length of requests issued to the device





  • await - Average number of milliseconds I/O requests for this device had to wait before being handled, including how long it took to handle them





  • svctm - Average time number of milliseconds I/O requests for this device had to wait before being handled





  • %util - Percentage of CPU time taken up by I/O requests being issued to the device





Measuring Disk Latency



Seek Time is the biggest performance issue on most traditional disk system. The amount of time it takes to get the disk heads into position and then wait until a particular sector is beneath the heads. Typical seek times vary largely due to the disk rotation speed.



 

































5400 RPM 9-14 ms
7200 RPM 6.7-7.0ms
10K RPM 4.5-5.2ms
15K RPM 3.5-4.2ms
SSD <0.1ms



 




The tool 'fio' has several modules for disk performance tests - the “random-read-test.fio” module is, as the name suggests, do random disk read tests for the purpose of measuring the latency of the disk.



'fio' is not installed by default in SLES/SLED or OES2 but it can be added on using the package in http://download.opensuse.org/repositories/home:/mge1512:/benchmarking/SLE_11/i586/



“Bonnie”



included in SLES11/SLED11 and OpenSUSE11 -



“Iozone” - also not installed by default - you can download it here - http://download.opensuse.org/repositories/home:/mge1512:/benchmarking/SLE_11/i586/



32 bit version -



http://software.opensuse.org/search/download?base=openSUSE%3A11.4&file=benchmark%2FopenSUSE_11.4%2Fi586%2Ffio-1.41-3.1.i586.rpm&query=fio



64 bit version -



http://software.opensuse.org/search/download?base=openSUSE%3A11.4&file=benchmark%2FopenSUSE_11.4%2Fx86_64%2Ffio-1.41-3.1.x86_64.rpm&query=fio



Note that this isn't supported in SLES/SLED or OES2.



The i/o scheduler -



The 2.6 Kernel introduced a concept called “i/o elevators” and has 4 i/o schedulers.





  • CFQ (completely fair)





  • NOOP





  • Anticipatory





  • Deadline





The choice of i/o scheduler can have a major impact on disk performance.



An i/o elevator is a queue where i/o requests are ordered based on where their sector is on the disk.



You can determine which scheduler you are using with dmesg



tunable parameters for the scheduler are in /sys/block/(device)/iosched



the CFQ scheduler targets “fair” allocation of i/o bandwidth among the initiators of requests.



Noop is optimized for large i/o systems using RAID or SAN connected storage



anticipatory is tuned for reducing per thread read response time and is best suited for systems with streaming of large files and synchronous disk reads. It is a poor choice for systems with random read/write workloads.



Deadline scheduler uses five i/o queue's to track requests,. It is intended to average read-request response times for random disk requests



Choosing the right file system for the job...



EXT2 - This is a non-journaling file system - Ideal for USB or SSD drive where you need high stability with minimal writes.



EXT3 - The main difference between EXT2 and EXT is the addition on Journaling - Journaling is meant to help with speed up graceful recovery after a system crash by backing out uncommitted changes. EXT3 has been the default choice for most situations for years.



EXT4 adds support for larger filesystems, faster checking, nanosecond timestamps and checksum verification of the journal.



ReiserFS - This is a good choice in situations where there are lots of small files but it doesn't work well with multicore pc's as the architecture is often limited to one operation at a time.



XFS is a highly tunable file system and has features like guaranteed rate i9/o, online resizing, quota enforcement and can support up to 8 exabytes of storage space (theoretically).



Btrfs - Still in development - it has support for “transparent compression, snapshots, cloning and 'in place conversion' from ext3 or ext4 - it is available now during development but it may be premature to deploy in production environments today.



References:



Novell Documentation



SUSE Linux Enterprise Server 11 - System Analysis and Troubleshooting Guide



http://www.novell.com/documentation/sles11/pdfdoc/book_sle_tuning/book_sle_tuning.pdf



IBM Redbook



Linux on IBM System z: Performance Measurement and Tuning



http://www.redbooks.ibm.com/redbooks/pdfs/sg246926.pdf



IBM RedPaper



Tuning SUSE Linux Enterprise Server on IBM eServer xSeries Servers



http://www.redbooks.ibm.com/redpapers/pdfs/redp3862.pdf



Linux Performance and Tuning Guidelines



http://www.redbooks.ibm.com/redpapers/pdfs/redp4285.pdf



Tuning Red Hat Enterprise Linux on IBM eServer xSeries Servers



http://www.redbooks.ibm.com/redpapers/pdfs/redp3861.pdf



Two Paths to Server Performance: Matthias G. Echermann and Bill Tobey: Novell Connection Magazine; July 2010



http://www.novell.com/connectionmagazine/2010/07/tt2.pdf