How to choose the right I/O kernel scheduler


Please note that this blog has been moved.

Now it has its own domain: mynixworld.info🙂

If you want to read the latest version of this article (recommended) please click here and I open the page for you.

I was searching for a way to improve my system responsiveness knowing that I/O is a major factor that can influence this. Yes, I can change my HDD with a affordable SSD but I was looking for something that would not cost me money. So I have read little bit about kernel I/O schedulers.

I Linux kernel 3.0.6 I have found just 3 options:

How would I know which one is the best for my hardware and for the type of work I am doing on my system? Hard to say, BUT actually you can test each of them to see which one gives you a better result.

So, the first step is to set your kernel to use one of these three I/O schedulers. The second step is to run a I/O benchmark on your hard-disk using that particular scheduler.

Change kernel’s I/O scheduler

How can you check which is the current kernel I/O scheduler? Run at the console the following command:

cat /sys/block/sda/queue/scheduler

It will return something like:

noop deadline [cfq]

The one which is included between parenthesis is the current kernel I/O scheduler. To change at runtime the kernel I/O scheduler you can run the following command at the console:

echo "<i/o scheduler>" /sys/block/sda/queue/scheduler

where <i/o scheduler> is one of those which your kernel support.

Run a I/O benchmark on your disk

There exists many I/O benchmark tools on the net. The one which is my favourite is IOzone. By simple running the iozone -h command at your terminal you will get a comprehensive list with all available options (and there are dozens).

I found an interesting post about how to stress your hard-disk and how to get those benchmark’s numbers that you can use in a spreadsheet to plot a visual chart which will help you understand which of these I/O schedulers behave better on your hardware:

https://bbs.archlinux.org/viewtopic.php?pid=969117

So I used the following bash script (eg: iozone-scheduler) which actually calls IOzone sequentially and then compiles a log file (a tab-delimited format) that one can use to plot a chart:

#!/bin/bash

# Test schedulers with iozone
# See https://bbs.archlinux.org/viewtopic.php?pid=969117
# by fackamato, Aug 1, 2011
# changelog:
# 03082011
# Added: Support for Linux MD devices
# Added/fixed: take no. of threads as argument and test accordingly (big rewrite)
# 02082011
# Added: Should now output to a file with the syntax requested by graysky
# Fixed: Add support for HP RAID devices
# Fixed: Drop caches before each test run

if [ "$EUID" -ne "0" ]; then echo "Needs su, exiting"; exit 1; fi

unset ARGS;ARGS=$#
if [ ! $ARGS -lt "5" ]; then
    DEV=$1
    DIR=`echo $2 | sed 's/\/$//g'` # Remove trailing slashes from path
    OUTPUTDIR=`echo $4 | sed 's/\/$//g'` # Remove trailing slashes from path

    # Create the log file directory if it doesn't exist
    if [ ! -d "$OUTPUTDIR" ]; then mkdir -p $OUTPUTDIR;fi

    # Check the test directory
    if [ ! -d "$DIR" ]; then
        echo "Error: Is $DIR a directory?"
        exit 1
    fi

    # Check the device name
    MDDEV="md*"
    HPDEV="c?d?"
    case "$DEV" in
        $HPDEV ) # HP RAID
            unset SYSDEV;SYSDEV="/sys/block/cciss!$DEV/queue/scheduler"
            unset MD;declare -i MD;MD=0
        ;;
        $MDDEV ) # mdadm RAID
            echo "Found a Linux MD device, checking for schedulers..."
            unset MD;declare -i MD;MD=1
            unset SYSDEV
            SYSDEV=$(mdadm -D /dev/md0 | grep active | awk -F '/' '{print $3}' | sed 's/[0-9]//g')
        ;;
        * )
            unset SYSDEV;SYSDEV="/sys/block/$DEV/queue/scheduler"
            unset MD;declare -i MD;MD=0
        ;;
    esac

    # Check for the output log
    unset OUTPUTLOG;OUTPUTLOG="$OUTPUTDIR/iozone-$DEV-all-results.log"
    if [ -e "$OUTPUTLOG" ]; then echo "$OUTPUTLOG exists, aborting"; exit 1;fi

    # Find available schedulers
    if [ $MD -eq 0 ]; then
        echo "not md device"
        declare -a SCHEDULERS
        SCHEDULERS=`cat $SYSDEV | sed 's/\[//g' | sed 's/\]//g'`
    else
        declare -a SCHEDULERS; unset MDMEMBER
        for MDMEMBER in ${SYSDEV[@]}; do
            unset SYSDEVMD;SYSDEVMD="/sys/block/"$MDMEMBER"/queue/scheduler"
        done
        SCHEDULERS=`cat $SYSDEVMD | sed 's/\[//g' | sed 's/\]//g'`
    fi
    if [ -z "$SCHEDULERS" ]; then
        echo "No schedulers found! Wrong device specified? Tried looking in $SYSDEV"
        exit 1
    else
        echo "Schedulers found under $DEV: "$SCHEDULERS
        SIZE=$(($3*1024)) # Size is now MB per thread
        unset RUNS; declare -i RUNS;RUNS=$5
    fi

    # Set record size
    if [ -z "$6" ]; then
        echo "Using the default record size of 16MiB"
        RECORDSIZE="16384" # Set default to 16MB
    else
        RECORDSIZE=$6"m"
    fi

    # Set no. threads
    if [ -z "$7" ]; then
        echo "Testing with 1, 2 & 3 threads (default)"
        THREADS=3
    else
        THREADS=$7
    fi

    SHELL=`which bash`
else
    echo "# Usage:"
    echo "`basename $0` <dev name> <test dir> <test size in MiB> <log dir> <#runs> <record size> <threads>"
    echo "time `basename $0` sda /mnt 20480 /dev/shm/server1 3 16 3"
    echo "# The above command will test sda with 1, 2 & 3 threads 3 times per scheduler with 20GiB of data using"
    echo "# 16MiB record size and save logs in /dev/shm/server1/ ."
    echo "# If the record size is omitted the default of 16MiB will be used. (should be buffer size of device)"
    echo "# For HP RAID controllers use device name format c0d0 or c1d2 etc."
    exit 1
fi

function createOutputLog () {
    unset FILE
    echo -e "Test\tThroughput (KB/s)\tI/O Scheduler\tThreads\tn" > $OUTPUTLOG
    for FILE in $OUTPUTDIR/$DEV*.txt; do
        # results
        unset WRITE;unset REWRITE; unset RREAD; unset MIXED; unset RWRITE
        # Scheduler, threads, iteration
        unset SCHED;unset T; unset I;unset IT
        SCHED=`echo "$FILE" | awk -F'-' '{print $2}'`
        T=`echo "$FILE" | awk -F'-' '{print $3}' | sed 's/t//g'`
        # FIXME, it's ugly
        IT=`echo "$FILE" | awk -F'-' '{print $4}'`
        I=`expr ${IT:1:1}`

        # Get values
        WRITE=`grep "  Initial write " $FILE | awk '{print $5}'`
        REWRITE=`grep "        Rewrite " $FILE | awk '{print $4}'`
        RREAD=`grep "    Random read " $FILE | awk '{print $5}'`
        MIXED=`grep " Mixed workload " $FILE | awk '{print $5}'`
        RWRITE=`grep "   Random write " $FILE | awk '{print $5}'`
        # echo "iwrite $WRITE rwrite $REWRITE rread $RREAD mixed $MIXED random $RWRITE"

        # Print to the file
        if [ -z "$WRITE" -o -z "$REWRITE" -o -z "$RREAD" -o -z "$MIXED" -o -z "$RWRITE" ]; then
            # Something's wrong with our input file, or bug in script
            echo "BUG, unable to parse result:"
            echo "write $WRITE rewrite $REWRITE random read $RREAD mixed $MIXED random write $RWRITE"
            exit 1
        else
            echo -e "Initial write\t$WRITE\t$SCHED\t$T\t$I" >> $OUTPUTLOG
            echo -e "Rewrite\t$RWRITE\t$SCHED\t$T\t$I" >> $OUTPUTLOG
            echo -e "Random read\t$RREAD\t$SCHED\t$T\t$I" >> $OUTPUTLOG
            echo -e "Mixed workload\t$MIXED\t$SCHED\t$T\t$I" >> $OUTPUTLOG
            echo -e "Random write\t$RWRITE\t$SCHED\t$T\t$I" >> $OUTPUTLOG
        fi
    done
}

unset ITERATIONS; declare -i ITERATIONS; ITERATIONS=0
unset CURRENTTHREADS; declare -i CURRENTTHREADS
unset IOZONECMD

cd "$DIR"
echo "Using iozone at `which iozone`"

until [ "$ITERATIONS" -ge "$RUNS" ]; do
    let ITERATIONS=$ITERATIONS+1
    for SCHEDULER in $SCHEDULERS; do
        # Change the scheduler
        if [ $MD -eq 1 ]; then
            unset MEMBER
            for MEMBER in $SYSDEV; do
                echo $SCHEDULER > /sys/block/$MEMBER/queue/scheduler
            done
        else
            echo $SCHEDULER > $SYSDEV
        fi
        CURRENTTHREADS=1
        # Repeat until we've tested with all requested threads
        until [ $CURRENTTHREADS -gt $THREADS ]; do
            unset IOZONECMDAPPEND
            IOZONECMDAPPEND="$OUTPUTDIR/$DEV-$SCHEDULER-t$CURRENTTHREADS-i$ITERATIONS.txt"
            #echo "iozonecmdappend is $IOZONECMDAPPEND"
            # Append all test files to the command line (threads/processes)
            unset I; unset IOZONECMD_FILES
            for I in `seq 1 $CURRENTTHREADS`; do
                IOZONECMD_FILES="$IOZONECMD_FILES$DIR/iozone-temp-$I "
            done
            # Drop caches
            echo 3 > /proc/sys/vm/drop_caches
            echo "Testing $SCHEDULER with $CURRENTTHREADS thread(s), run #$ITERATIONS"
            IOZONECMD="iozone -R -i 0 -i 2 -i 8 -s $SIZE -r $RECORDSIZE -b $OUTPUTDIR/$DEV-$SCHEDULER-t$CURRENTTHREADS-i$ITERATIONS.xls -l 1 -u $CURRENTTHREADS -F $IOZONECMD_FILES"
            # Run the command
            echo time $IOZONECMD
            time $IOZONECMD | tee -a $IOZONECMDAPPEND
            # Done testing $CURRENTTHREADS threads/processes, increase to test one more in the loop (if applicable)
            let CURRENTTHREADS=$CURRENTTHREADS+1
        done
    done
    echo "Run #$ITERATIONS done" | tee -a $IOZONECMDAPPEND
done

echo
createOutputLog
echo "Done, logs saved in $OUTPUTDIR"
exit 0

So, to test my disk I had used the following command at the terminal:

sudo iozone-scheduler sda /mnt 1024 /dev/shm 3 8 3

where:

  • sda is the name of my disk device as recognized by Linux (check your /dev/)
  • /mnt is the folder where the test file will be saved temporary
  • 1024 represent the size in MB of the test file (where I/O will run)
  • /dev/shm is the folder where will be saved the XLS spreadsheets and log files
  • first 3 is the number of runs (cycles) of the test
  • 8 is the record size in MB that will be used for IOzone test (I set 8 because my hdd buffer size is 8MB)
  • the last 3 represents the number of the maximum concurrent threads that will perform I/O operations

Well, I got mine iozone-sda-all-results.log which have the following structure (tab-delimited):

Test    Throughput (KB/s)    I/O Scheduler    Threads    n
Initial write    45654.45    cfq    1    1
Rewrite    49748.07    cfq    1    1
Random read    915288.75    cfq    1    1
Mixed workload    1243356.50    cfq    1    1
Random write    49748.07    cfq    1    1
Initial write    60800.41    cfq    1    2
Rewrite    64921.82    cfq    1    2
Random read    1242507.88    cfq    1    2
Mixed workload    1251540.75    cfq    1    2
………..

Using the above information I draw a chart for every of the following 5 I/O tests I’ve run:

  • Initial write
  • Re-write
  • Random write
  • Random read
  • Mixed workload

Sample chart based on IOzone result

I have 5 tests on 3 distinct runs, that means a total of 15 charts to plot. Well, I am not going to post all those here (make no sense) but I will tell you that: sometimes cfq behaves better then deadline which behaves better than noop, other time is vice-versa, other time is…. so I have got some interesting info every time. It is hard to decide which one is better than other (because one is better than other on reading, other on writing, other is better in the 2nd run than in the 1st run, etc) . In order to determine which one to choose I approached the problem with the following naive method:

  • I compared each runs individually
    • for each test I compared which of those I/O schedulers behaves better than others
    • I gave 2 point for the best, 1 for the average and 0 for the worst one
  • I added those points that each I/O scheduler have obtained
  • the one which added the most points I decreed as a winner.

I tested my laptop and my desktop workstation.

On my laptop where I have an TOSHIBA MK1637GSX disk the cfq I/O scheduler was the winner (20 points) and noop was the looser (11 points).

On my desktop where I have an WDC WD5000AAKS-60A7B2 disk the noop I/O scheduler was the winner (18 points) and deadline was the looser (11 points). Very close came the cfq (15 points) but noop was little better on all test so I decided to use noop in the future on that system.

Another interesting piece of information is “how much has improved  the I/O by changing the kernel scheduler?”

Well, the difference is not a magnitude order but sometimes is 30% better, other time just 18% or only 2%. So the difference can vary between 0-30% or even over. But any improvement is welcome so when you get a positive improvement why not get it?

How to permanently change your kernel I/O scheduler

Well, the method I would prefer is to recompile the kernel, so:

  • Enable the block layer —>
    • IO schedulers —>
      • enable Deadline I/O scheduler (if you intend to use it)
      • enable CFQ I/O scheduler (if you intend to use it)
      • Default I/O schedulers —>
        • check on of the Deadline, CFQ or No-op available schedulers that fits your need.

After you recompile and install your new kernel your I/O operations should (hopefully) behave little better.

About Eugen Mihailescu

Always looking to learn more about *nix world, about the fundamental concepts of arithmetic, algebra and geometry. I am also passionate about programming, database and systems administration.
This entry was posted in kernel, linux and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s