cancel
Showing results for 
Search instead for 
Did you mean: 

Heavy I/O Busy after ASE upgraded from 15.5 to ASE 16.0 SP02 PL02

Former Member
0 Kudos

Version : Adaptive Server Enterprise/16.0 SP02 PL02/EBF 25318 SMP/P/RS6000/AIX 7.1/ase160sp02plx/2492/64-bit/FBO/Sun Nov 22 18:44:08 2015

OS : AIX xxxxxx 1 7 00CDEF534C00

We have upgraded ASE from 15.5 to ASE 16 .

As per sysmon report . After upgrade "I/O Busy"  under Kernel Utilization  is creased 5 time compare to version 15.5  during same duration..

Is this known issue ... So most if the now IQ busy showing

Sysmon  in ASE 15.5  :  12.1 %.   ( before upgrade 10-20% Average) .   Only on DB Backup it was used to go high .

Engine Busy Utilization        CPU Busy   I/O Busy       Idle

  ------------------------       --------   --------   --------

    Engine 0                       39.2 %     23.0 %     37.9 %

    Engine 1                       39.4 %     19.4 %     41.2 %

    Engine 2                       38.4 %     15.9 %     45.7 %

    Engine 3                       38.7 %     15.7 %     45.6 %

    Engine 4                       37.6 %     16.4 %     46.0 %

    Engine 5                       34.8 %     16.7 %     48.5 %

    Engine 6                       36.9 %     17.4 %     45.8 %

    Engine 7                       34.7 %     13.4 %     51.9 %

    Engine 8                       37.7 %     16.7 %     45.6 %

    Engine 9                       33.2 %     19.2 %     47.6 %

    Engine 10                      40.7 %      8.1 %     51.2 %

  ------------------------       --------   --------   --------

  Summary           Total         411.2 %    181.8 %    907.0 %

                  Average          27.4 %     12.1 %     60.5 %

ASE 16 : (72.2 %)  -- most of the time (70-90%)

Engine Utilization (Tick %)   User Busy   System Busy    I/O Busy        Idle

-------------------------  ------------  ------------  ----------  ----------

  ThreadPool : syb_default_pool

   Engine 0                        26.7 %         0.2 %      72.8 %       0.3 %

   Engine 1                        32.8 %         0.0 %      67.1 %       0.2 %

   Engine 2                        25.4 %         0.5 %      73.8 %       0.3 %

   Engine 3                        28.2 %         0.5 %      71.2 %       0.2 %

   Engine 4                        27.5 %         0.2 %      72.0 %       0.3 %

   Engine 5                        31.5 %         0.0 %      68.5 %       0.0 %

   Engine 6                        30.1 %         0.2 %      69.7 %       0.0 %

   Engine 7                        26.7 %         0.3 %      72.7 %       0.3 %

   Engine 8                        28.7 %         0.3 %      70.8 %       0.2 %

   Engine 9                        23.7 %         0.3 %      75.6 %       0.3 %

   Engine 10                       27.8 %         0.3 %      71.8 %       0.0 %

   Engine 11                       25.5 %         0.5 %      73.8 %       0.2 %

   Engine 12                       29.3 %         0.5 %      70.0 %       0.2 %

   Engine 13                       23.4 %         0.0 %      76.6 %       0.0 %

   Engine 14                       24.5 %         0.0 %      75.3 %       0.2 %

   Engine 15                       25.9 %         0.5 %      73.6 %       0.0 %

  -------------------------  ------------  ------------  ----------  ----------

  Pool Summary        Total       437.7 %         4.3 %    1155.4 %       2.6 %

                    Average        27.4 %         0.3 %      72.2 %       0.2 %

Thanks

Ajay Pandey

Accepted Solutions (0)

Answers (4)

Answers (4)

Former Member
0 Kudos

https://scn.sap.com/thread/3764504

Guys can you go to this link . seems there is some change request to fix #757246 in  sysmon report .

simon_ogden
Participant
0 Kudos

There may be a CR but I don't expect it will be 'fixed', it isn't an sp_sysmon issue. If anything they could remove that column completely, it would cause less confusion.

Former Member
0 Kudos

Hi

Check this query

select top 10 * from master..monSpinlockActivity order by Contention desc

and read

Spinlocks and CPU usage in SAP ASE - SAP ASE - SCN Wiki

Hans

Former Member
0 Kudos

Spin lock monitoring configuration is not configured . But from othee MDA tables .  We are seeing a lot of processes  showing from monProcess  and monWaitEventInfo  .. not sure this is spin lock or cache issue .

  • wait for mass read to finish when getting page
  • waiting for regular buffer read to complete

Event 124: wait for mass read to finish when getting page

Event 124 occurs when a process attempts to perform a physical read but another process has already performed the read request (this also counts as a “cache miss”).

Event 29: waiting for regular buffer read to complete

A wait caused by a physical read (most likely a cache miss) which occurs when Adaptive Server does not find a page in the data cache and must read it from disk. The number of Waits is the number of physicalreads that occurred because of a cache miss. Use the monSysWaits.WaitTime value to derive I/O response times

former_member188958
Active Contributor
0 Kudos

Before you get too worried about increased physical io, have you compared your overall throughput performance?  If throughput has greatly improved, increased physical io may not be a bad thing - it could indicate that ASE has become more efficient at processing queries and so is processing more transactions per minute and so simply hitting more cases where a physical io is needed per minute.

An increase in the two wait events listed above seems consistent with an increased rate of physical io.

-bret

c_baker
Employee
Employee
0 Kudos

Also - don't forget that you have upgraded from a process-mode kernel to a threaded-kernel (I hope you are using this kernel - if not, why not?).

In that case, there are different thread pools responsible for IO.

Chris

simon_ogden
Participant
0 Kudos

Hi Ajay,

Forget the i/o busy comparison to the earlier version, it is meaningless on the threaded kernel.

A few years ago I tried to come up with a way of massaging the i/o busy percentage into a useful metric in terms of determining whether a server is i/o bound. The conclusion was you simply can't. There was no predictable correlation between the amount of i/o and the legacy i/o busy percentages.

There is no concept of a user engine thread being i/o busy (that is apart from synchronous i/o which you almost certainly won't be using).

You need to look at the number of i/os over a given period, and whether there are any polls returning max events. This combined with the thread utilisation of the DiskController are the key metrics.

If you compare the actual number of i/os over a given period, that will give you something more comparable.

In practice, out of our 100 or so production servers, only 2 of them have very small periods where an extra disk i/o thread is needed.

This is from our wiki site.

  • Disk I/O is processed via the DiskController task, each one (you can run more than one) has a thread in syb_system_pool.
  • @@io_busy, i.e. engine ticks that report i/o means there was at least one i/o being waited on. It is NOT the same measurement as @@io_busy on 12.5.4. Any engine that is idle at tick measurement time (i.e the signal handler loop) will be classed as i/o busy if there is any I/O outstanding.
  • I/O busy for an engine has nothing to do with how much I/O that engine is 'responsible' for, all threads can run any work at the point an I/O is returned,
  • The DiskController thread is entirely dedicated to polling for I/O completions.
  • An actual I/O is still issued by a task that would be running on an engine thread at a particular instance (which on the last i/o may have been on another different thread), but it is key to understand that this 'source' thread is irrelevant as it gives no indication of how busy a particular engine is with i/o as when an i/o completes the task can be scheduled to run any other engine thread. Very important to grasp that i/o busy is now a measurement against the engine thread pool in its entirety. I.e i/o busy of the logical engine group (sort of, but not linearly).
  • One DiskController thread can process 200,000 i/os per second (according to Sybase on Xeon x64), see below analysis which confirms it is very hard to provide enough I/O to overload a single disk task.
  • High @@io_busy DOES NOT mean more disk i/o threads are needed. It could mean:- 
    1. Poor disk i/o speed.
    2. The workload is entirely i/o specific and as such i/o bound, i.e. there's nothing more for the engines to do apart from wait for i/o to finish in order to continue work.
    3. If the thread utilisation is high then it should follow that it is likely returning a high number of events per poll or a number of polls are returning the maximum number of events. This is the only circumstance where an additional disk i/o task might help.

And a little more:

...

This method is still sound, but there is a key difference when it comes to @@io_busy. In the old process mode kernel then I/O was issued by an engine itself, therefore when the OS sends an interrupt every tick, an engine would register as i/o busy if indeed it was in the scheduler and has i/o pending. This i/o busy was counted against that engine (as it should be of course)

So say you had 10 engines and you were capturing 10 seconds worth of activity, so that's '100 ticks', if engines 1 and 2 registered i/o for 98 of those ticks each, cpu busy for 2 of them, no idle, and all other engines were 50% cpu, 50% idle, then this would equate to:-

10x100 ticks =1000 ticks to be accounted for.

196 ticks i/o
404 ticks cpu busy
400 ticks idle

So over the period, we have 19.6% io busy, 40.4% cpu, 40% idle

The i/os in threaded kernel whilst initially issued by an engine thread will be actually be returned via the disk i/o thread in syb_system_pool. For this reason the concept of i/o busy for an engine thread is a misnomer.

The real key point that needs to be understood with threaded kernel diagnostics, is that if there is ANY outstanding i/o being waited on by the i/o thread then for ANY engine that is 'officially' idle, it will be reported as i/o busy. The theory is the alternative is to have zero i/o busy, which would have been worse (maybe), those outstanding i/os could be for any task which could be run any engine's run queue, so I guess what it is saying is that if that it's only idle as the i/o hasn't come back.

Anyway, what this means in terms of diagnostics is that for the situation above the io_busy could report significantly higher.

Let's say there was outstanding i/o on half of the ticks.

Now, the 2 engines before that were 98% i/o busy are in theory now 98% idle, but of course they wouldn't be as they would now likely be doing a little more cpu as the task(s) requiring the i/o could be processed anywhere.
So, let's say they are now (this is an entirely arbitrary figure for the purposes of illustration) 50% cpu, that leaves 50% idle, but we have i/o outstanding 75% of the time, to that's 37.5% i/o busy and 12.5% idle.
The other 8 engines were 50% idle, so they are now 50% cpu, 12.5% idle, 37.5% i/o (as the other engines are)

So now, we have:-

500 ticks cpu busy
375 ticks i/o
125 ticks idle

So now the overall utilisation is:-

37.5% i/o busy, 50% cpu, and 12.5% idle.

If you see high I/O busy for engine utilisation via sp_sysmon or indeed via the global variables then yes the indication is that the task or tasks are still heavily i/o bound. However, if there is a bottleneck, you need to dig a little deeper to see where it is.

former_member182259
Contributor
0 Kudos

Actually, your wiki needs some correction - engines *still* submit the aio_read()/aio_write() calls.   The only thing the engines do not do is the AIO polling for completion and subsequent run queue manipulation.  That is what the disk task does - and the only thing it does.

IO Busy being reported by the engines is the relative percentage of time spent doing the IO calls to issue the reads/writes.   There can be a number of factors that influence this - the famous Solaris bug was one such that could drive IO busy to 100%.   We have seen others - and in particular when folks insist on using journaled file systems in which the journaling is for more than the metadata (journaled file systems for DBMS storage is useless) or when the file system isn't tuned very well (e.g. we have seen issues on AIX without CIO being forced or when the ASE page size is smaller than the filesystem block size).   Linux is especially atrocious as the IO scheduler can get involved....and its rather lightweight desktop tuned IO subsystem can be rather anemic unless aggressively tuned.

Having said all of that, CPU%'s in monEngine and sp_sysmon (engine area) are computed and can be a bit off due the fuzziness of trying to interpret values from the ticks.   monThread is the best way to go, but it lacks ticks for IO - merely busy, idle, sleep

As far as some of the other points, WaitEvents are not spinlocks - so high WaitEvents around normal reads (29) and APF's (124) would be in line with the IO busy percent.

What I would suggest is that person investigate:

1) device attributes (make sure dsync is OFF and directio is ON)

2) check if using raw devices or filesystems.   In either case, make sure the OS tuning is done with respect to number of AIO's permitted, fastpath enabled, etc.

3) If filesystems, make sure that:

     a) filesystem block size is smaller or same size as ASE page size

     b) file system is mounted with force cio option

     c) check on JFS tuning - normally JFS on AIX is not an issue other than (a) and (b) but sysadmins might have done something weird

     d) check HW - there is an issue with AIX when LPARS are using VIO servers and the VIO has iSCSI enabled (I think it was this) and no iSCSI in use.....

However, I also agree with Marcin - this could be a problem of a lot more query issues and resulting cache misses due to not using indexes or using different indexes than previously.

simon_ogden
Participant
0 Kudos

"Actually, your wiki needs some correction - engines *still* submit the aio_read()/aio_write() calls.   The only thing the engines do not do is the AIO polling for completion and subsequent run queue manipulation.  That is what the disk task does - and the only thing it does"

No real modification needed with respect to that, I have made it quite clear as you can see from these sections taken directly from my post. I am not sure how much more this could agree with what you've just said 🙂


The DiskController thread is entirely dedicated to polling for I/O completions.


The i/os in threaded kernel whilst initially issued by an engine thread will be actually be returned via the disk i/o thread in syb_system_poo


An actual I/O is still issued by a task that would be running on an engine thread at a particular instance (which on the last i/o may have been on another different thread), but it is key to understand that this 'source' thread is irrelevant as it gives no indication of how busy a particular engine is with i/o as when an i/o completes the task can be scheduled to run any other engine thread

former_member182259
Contributor
0 Kudos

I guess it was the starting aspect that said "Disk I/O is processed via the DiskController...." that is a bit misleading.   It is correct later, but unless you are already aware of the distinction, someone can be a bit confused.   I only mention this because I have seen this a lot where despite it being made clear later in docs, etc., the earlier mentions of "disk I/O being done by the DiskController" seems to be the only thing that sticks in peoples minds.   As a result, I always start out by making it clear that the engines actually submit the IOs and the only thing the disk task does is the polling - and try to avoid any statements around disk tasks "handling" the IO's.

It made sense from an engineering perspective to stress the disk task as read/write requests normally should be miniscule time whereas the disk polling was always a heavier load (especially with non-blocking async polling vs. the threaded kernel blocking poll). 

However, given that the individual is showing 100% cpu (30% user and 70% IO), unlike others, I don't believe there is not a problem, and this is sort of normal behavior for threaded kernel to show 100% CPU.     I think it needs looked into.    First thing I would do would be to get a real picture of true CPU usage from monThread.   If it shows 100%, then there definitely is an issue.   If it shows 50%, then there is an issue with the way IO Busy is being computed for engines in that person's release.

simon_ogden
Participant
0 Kudos

I think maybe you misread his sp_sysmon output due the columns not being aligned, it is 27% CPU and 72% I/O. Not sure where 100% CPU has come from?

It is quite common (in fact fairly standard) for the idle to be very low now compared to the old process mode, it is just the inverse of the whole problem with the i/o busy.

former_member89972
Active Contributor
0 Kudos

Simon

What/which wiki site/page are you referring to ?

Can you please update your reply with a link to that wiki site/page ?

TIA

Avinash

simon_ogden
Participant
0 Kudos

It is our wiki within my organisation, nothing public

former_member182259
Contributor
0 Kudos

Not sure that I did - if User% + IO% = 100% in his case.   ....don't confuse the 15.5 headings where CPU is listed (when it should be User)....the 15.7 headings in his case were quite clear - User% was ~30% and IO% was ~70% and that means 100%CPU.

simon_ogden
Participant
0 Kudos

We'll end up with the bad formatting again but I don't think you are correct unless we are talking about different things. We are talking here about engine utilisation aren't we?

When an engine is idle and there is any outstanding i/o this will be registered as i/o busy across all engines. It isn't the relative percentage of time spent actually issuing the read/write calls. This is the whole situation which I explained above. The 'new' idle is effectively idle+i/o busy. For 95% of our business day, User + I/O will be approaching 100%, this is expected for a system that does any reasonable amount of I/O and cpu bound work together.

marcin_najs
Explorer
0 Kudos

Hi.

I think sysmon is not good enough to compare performance between ASE versions.

You should check in monitoring tables (MDA) which processes performs most physical I/O operations and find some SQLs / exec. plans. Mabye it is an optimizer issue or just bad query plan... Did you modify any configuration parameters between 15.5 and 16.0 (especially those responsible for optimizer behaviour eg. "optimizer level") ?

Best Regards.

--

Marcin Najs

former_member89972
Active Contributor
0 Kudos

Did you change the kernel mode from "process" in 15.5  to "threaded" in 16.0  ? 

If so,  the comparison of 15.5 Sysmon vis a vis 16.0 Sysmon may not be as expected.

Check the  P and T manual section(s) for Sysmon for threaded kernel mode and start interpreting the figures in that light.

HTH

Avinash