on 08-01-2014 10:11 PM
Good Day,
We have an IQ 15.4 ESD#3 running on Red Hat 5.5 on Linux which is attached to an old EMC Symmetrix SAN.
Usually during peak day hours , say 9:00am-noon and 3:00pm - 5:00pm (basically all day)we have monitored using top and iostat with the outputs like these ones(in average) :
16 core machine
CPU %user %system %iowait
all 35.92 83.00 60.15
iostat output is attached(please see)
If you look at avgqu-sz , await* ,svctm and %util we see values that according to Linux information idcicate that there is a disk bottleneck problem which causes top to show alow %user time and a high system% time
Also there are some lack of HG indexes that should be created , but we are not sure that this could fully end the i/o problem
Could you confirm this analysis?
Note: Regarding "svctm" , when running "man iostat", there is a note which states: "Do no trust this field anymore". This field will be removed in a future sysstat version"--does this mean we sholud ignore this parameter?
Thank you
Regards
Jose-Miguel
Very high await time (10000 to 25000 miliseconds) indicates threads do not get enough time to be processed by CPU along with DISK processing. Along with iostat you should check vmstat output and see if there are many threads under kernel threads "blocked" column. If that is the case, then CPU is over loaded and you should also look at adding more CPU power to the machine.
Are these 16 cores are physical CPUs or less physical CPUs and multi-threaded CPUs showing logical CPUs as 16?
Regards
Shashi
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You definitely have CPU resource issue , which is causing additional problem of I/O bottleneck.
From the vmstat I see following,
32 235 49232 1179624 972076 104000384 0 0 153760 59023 8211 56896 35 50 1 15 0
50 123 49232 1179284 972260 104001928 0 0 90284 100199 11973 48552 42 44 0 13 0
and then,
73 126 49232 1161728 976076 104003552 0 0 45075 35660 6094 24953 39 55 4 3 0
27 91 49232 1157320 976324 104005536 0 0 136411 45769 6727 35513 25 65 1 9 0
35 67 49232 1155796 976560 104008048 0 0 156995 46076 5838 35792 19 70 1 10 0
27 113 49232 1157060 976816 104008336 0 0 65365 48806 6049 31659 29 63 1 7 0
29 126 49232 1157476 977036 104007464 0 0 34984 69843 6330 28955 19 69 4 7 0
48 93 49232 1158464 977280 104008608 0 0 56310 49762 6776 34677 25 67 3 6 0
12 131 49232 1151384 977512 104010960 0 0 86959 92209 6479 47788 36 48 3 13 0
35 129 49232 1152320 977784 104012144 0 0 93003 21899 7592 67730 40 32 7 22 0
24 76 49232 1149172 978020 104012856 0 0 47984 62455 5677 31412 34 60 2 3 0
29 23 49232 1150272 978212 104013928 0 0 92435 56222 7290 26770 50 46 1 2 0
As you have 16 CPUs and blocked threads are in hundreds, causing the issue. Most likely cause is on this server IQ or some other process (if there is another application running) is using CPU intensive task as well as I/O intensive task and you do not have enoough CPU power to handle I/O, Even you reduce iqnumbercpus in .cfg file, it wil lonly restrict optimizer, but not threads.
I do see such pattern at intermittent intervals in vmstat, indicating I/O is also random. As you also have extremely high I/O wait, FIRST thing you should do is check disk speeds at OS level and then controllers.
What is your "number of users" setting and how many users are actually active at peak time on the IQ server?
Below script of vmstat will also "add" timestamp, which will help in confirming with .iqmsg in terms of timing.
cat vmstat.scr
DIR_NAME=SET_THIS_AS_PER_YOUR_ENVIRONMENT_TO_COLLECT_DATA
MON=`date +%m`
DAY=`date +%d`
Hour=`date +%H`
MIN=`date +%M`
SEC=`date +%S`
PLATFORM=`uname -s`
export PLATFORM SEC MIN Hour DAY MON DIR_NAME
LOGFILE=$DIR_NAME/vmstat$MON$DAY$Hour$MIN$SEC.log
exposrt LOGFILE
vmstat 2 2 > $DIR_NAME/@@vmstat.log
if [ $PLATFORM != "AIX" ]
then
echo `date +%m:%d:%H:%M:%S` `cat $DIR_NAME/@@vmstat.log |sed 3d | sed 2,3d` >> $LOGFILE
else
echo `date +%m:%d:%H:%M:%S` `cat $DIR_NAME/@@vmstat.log |sed 7d | sed 6,7d` >> $LOGFILE
fi
if [ $PLATFORM != "AIX" ]
then
echo `date +%m:%d:%H:%M:%S` `cat $DIR_NAME/@@vmstat.log |sed 3d | sed 1d | sed 2d` >> $LOGFILE
else
echo `date +%m:%d:%H:%M:%S` `cat $DIR_NAME/@@vmstat.log |sed 7d | sed 1,5d | sed 2d` >> $LOGFILE
fi
while true
do
vmstat 2 2 > $DIR_NAME/@@vmstat.log
VMSTAT_PID=$!
if [ $PLATFORM != "AIX" ]
then
echo `date +%m:%d:%H:%M:%S` `cat $DIR_NAME/@@vmstat.log |sed 1,3d` >> $LOGFILE
else
echo `date +%m:%d:%H:%M:%S` `cat $DIR_NAME/@@vmstat.log |sed 1,7d` >> $LOGFILE
fi
sleep 2
done
Shashi,
-gm is at 100 , usually at peak hours is 100% user connections. Anyway, this weekend we’re moving to a new storage ( HP 3Par). We ran a full DB backup in this new SAN and took ~10hours for 6.5TB vs 30hr on the old EMC Symmetrix
Also as you pointed , there is a Linux process running periodically called “SPAZIO MFT/S Managed and Secure File Transfer” which we assume could be causings some I/O problems and cpu contention. Do you know something about this process?
Thank you very much for the vmstat script!
Regards
Jose-Miguel
First you will need to resolve disk contention issue, adding HG index will not eliminate disk contention issue.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
86 | |
10 | |
10 | |
9 | |
6 | |
6 | |
6 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.