Solved: [Architecture] Question about performance - SAP + ...

Former Member · ‎11-22-2013

Hello,

I've got one SAP system on Oracle - single Oracle instance without ASM. It has average (user calls average based on 2 months period) 17 000 user calls per second and daily around 2 million dialog steps + 1,5 million of RFC+UPD+BGD steps together.

We are following SAP rules regarding DB parameters and tuning expensive SQL statements (in top 20 selects sorted by DB time we don't have any
select with more than 5 blocks read by each row, I'm aware that it is not only one criteria, but it is to show that system is being tuned (not that all
are tuned but all of top resource consuming SQLs from cursor cache looks fine) from expensive SQLs perspective. I’m aware that very often expensive SQLs causing very often unnecessary load.

Our storage vendor did checks and claims that there are no hot spots on storage and it has a free resources to handle current load twice (IOPs). I would like to know where this configuration could have bottlenecks. Some times (especially when load on db is higher than usual) I see queue and wait times on dm* devices. Configuration looks in this way:

Linux SLES 11 SP1. LUNs configured from OS to storage:

Oracle parameters align with SAP recommendations (as mentioned above)
Filesystem ext3 with mount options: rw,noatime,nodiratime,barrier=0 (5 sapdata LUN/fs - 2TB each)
LVM:
lvm2, Metadata Areas 1, Clustered yes, PE Size 4.00 MB
Linux multipathing is used with configuration:
features='1 queue_if_no_path' hwhandler='1 alua' wp=rw policy='round-robin 0'
Then lower:
FC network (2 cards in server, 2 FC switches, 2 Controllers in storage array, storage array 160+ spindle disks + 12 ssd for tiering - no significant waits visible from storage array statistics).

Please let me know if you have experience with similar load and Oracle on Linux architecture and what configuration you have or what you would improve for above (please give a hints basing upon your experience)

Thanks in advance,

Marek

fabian_herschel · ‎11-27-2013

Just a question. Are you still on SLES 11 SP1? Any needs not to update to either SP2 or the even newer and current SLES 11 SP3?

Beside that - do you really have performance issues or are you trying to tune the system (which

could be a different approach).

A good option you already got is to switch-off the barriers for ext3 for SLES 11 SP1 as they really could slow.down the I/O.

stefan_koehler · ‎01-07-2014

Hi Marek,

i just spotted this thread after you have replied yesterday. It seems like there are mixed up a lot of things together without considering the "right metrics" and requirements.

> in top 20 selects sorted by DB time … but it is to show that system is being tuned

Oracle DB Time <> DB Response Time (from client perspective). So the described approach is basically used to reduce the CPU or I/O load on the database server and not for "performance tuning" (e.g. PX can produce much more load, but may be faster in consequence). I have written a blog post about this topic and its misconception here:

> 17 000 user calls per second and daily around 2 million dialog steps + 1,5 million of RFC+UPD+BGD steps together.

You mentioned I/O queues and possible bottlenecks, but the mentioned key metrics have no relation to it (e.g. one database FETCH call can issue 50.000 I/Os or no I/O at all or think about DB backups, etc.). So you can not correlate any user calls per second to a specific CPU or I/O load.

> During high workload load average is between 40 and 50 … Regarding CPU - current 12 CPU cores are still enough - but we are going to replace hosts due to growing business and as during problematic situation system has been like a snow ball with waiting for I/O and normal CPU processing. Except problems due to bugs (SQL code, storage firmware bug) we haven't performance issues. We expect to have load twice higher during next year - thats the reason why I'm aware of performance.

So you have some specific issue like "a snow ball with waiting for I/O and normal CPU processing". Such issues could be profiled perfectly and drilled down with a tool called "perf" in newer kernel releases. I would try to figure out the root cause first, before switching anything just based on assumptions (as far as i get it right).

> Second thing with architecture is that I would avoid going to the RAC as system is not so big yet

Absolutely - if you can handle the current workload with 12 CPUs there is absolutely no reason to switch to RAC (for scale out reasons). Todays x86 commodity hardware can scale up to much more than your current hardware configuration. You have to generate a tremendous amount of load for the need of scale out solutions like RAC. That's it from a load perspective - however there maybe other reasons for switching to RAC depending on the business requirements (OS upgrade with no downtime, rolling upgrades, consolidation, etc.). It is also very easy to degrade performance with RAC, if you do not look carefully (e.g. cache fusion with 3-way block gets in worst case, etc.).

> Does anyone know any bigger SAP installations on ASM (w/o RAC or with RAC) on Linux (not on Exadata) - if yes please - post pros and cons.

Any bigger than what? You have provided no IOPs values or anything like that. Yes, i know / have seen a lot of "large" mission critical systems on ASM. "Pros" and "Cons" depend on the infrastructure, the used features and the internal knowledge (of each team member).

> Does anyone used ASM with SLES HA ?

Yes, you can use SLES HA (Pacemaker) with Oracle Restart (as ASM is part of the GI stack nowadays) and do the ASM disk group handling by HA script (as it needs to be mounted exclusively by non clustered ASM). In consequence each database need its own ASM disk group(s) and you need to test it very carefully.

> As I have other systems with around 1 milion dialog steps per day working on vSphere 5 without issues. Please let me know if you have experience with ERP systems 2 milions (and average more than 15 000 user calls in Oracle per second) or more dialog steps per day on VMWare.

You are looking at the wrong key metrics again. The amount of dialog steps has no correlation to the important factors like CPU or I/O load. You need to measure the (OS based) load of your current system and based on that data you decide if it is possible. VMware has limitations and some of them are increasing (existing) performance issues (e.g. vCPUs and its scheduling in case of allocation). I have built up high performance Oracle HA infrastructures on VMware (e.g. with RDM, etc.) and it runs well, but only if you are not hitting some specific limits / constellations.

I highly recommend that you collect the required data first and make your decisions based on that (and not on things like database calls or dialog steps):

Gather necessary raw data with nmon
Create report (from raw data) with nmon analyzer and interpret it
Get SAPS specs of your current hardware
Calculate grows and needed resources depending on requirements and adjust the "rule of thumb" formula (see below) with your measured values. You are pretty lucky as you can measure your "real world values" and so you can adjust the SAPs to I/O ratio easily based on your system load and scale it.

All of the following "rule of thumb" values are based on unicode systems:

2.5 SAPS (DB + SAP App Server) generate 1 I/O operation per second
- 1 concurrent SAP NW user generates ~9 I/Os per second
0.3 DB-Server (DB-centric) SAPS generate 1 I/O operation per second
- Variation of DB-SAPS : App-SAPS is significant for different SAP modules e.g., 1:3 for ERP = OK versus 1:15 for CRM = too high I/O load for DB-Server

… or …

0.4 I/O per second / SAPS (ERP, PI)
0.6 I/O per second / SAPS (BW)
0.2 I/O per second / SAPS (assumption for SAP Components with low I/O requirements, e.g. EP, CRM, SolMan)
1 concurrent active ERP user= 10 SAPS
1 concurrent active CRM, BW user = 20 SAPS
Complex application mix (incl. EP, PI, BW, ERP)= 22 SAPS

Ok that are my 2 cents so far for free … i hope this helps to go in the right direction.

Regards

Stefan

Former Member · ‎01-06-2014

Hello,

sorry for delay. I had longer break :-).

Sergo - during high workload load average is between 40 and 50. During normal day (not end of month etc) maximum is around 30. But this value is a result of CPUs, storage devices etc (I assume that you know how it is calculated). We have had problems. They were only connected once with bad coding (SQLs) and once with bug in storage firmware (then load average was around 70 - 80 and system performance was not acceptable). Regarding CPU - current 12 CPU cores (with HT it gives 24 cores, but HT shouldn't be perceived directly as computing power + sometimes could decrease performance of single instead of improving it) are still enough - but we are going to replace hosts due to growing business and as during problematic situation system has been like a snow ball with waiting for I/O and normal CPU processing (I'm aware that not enough IOPS & throughput could kill as many CPUs as there would be, but the goal is to have balanced system [I/O <-> CPU] with reserve for peaks)

Sergo, Fidel, some of data for last 7 days - I can't give exact data here, below are some rounded statistics (please keep in mind this is the period with a lot days off - there were little more than 8 milion dialog steps only):

EVENT_NAME	AVG_MS	PERCENT
Network	less than 0,5	around 42%
db file sequential read	less than 3	around 41%
CPU		around 13%
enq: TX - row lock contention	around 60	around 1%
read by other session	around 4	less than 1%

Fabian - thanks for information about SP3. We are going with upgrade to SLES 11 SP3. Linux mechanisms (at least looking at newer kernel and multipathing implementation) should be more efficent .Regarding reason of my question: except problems due to bugs (SQL code, storage firmware bug) we haven't performance issues. We expect to have load twice higher during next year - thats the reason why I'm aware of performance. We are currently in the process of exchange hosts and storage. However during exchange we would like to remove possible bottlenecks in software layer and currently we considering Oracle ASM (however it has also some cons). Does anyone know any bigger SAP installations on ASM (w/o RAC or with RAC) on Linux (not on Exadata) - if yes please - post pros and cons.

Second thing with architecture is that I would avoid going to the RAC as system is not so big yet.. Does anyone used ASM with SLES HA ?

Third thing is that I would go to the VMWare. As I have other systems with around 1 milion dialog steps per day working on vSphere 5 without issues. Please let me know if you have expierience with ERP systems 2 milions (and average more than 15 000 user calls in Oracle per second) or more dialog steps per day on VMWare.

Thanks in advance.

Marek

fidel_vales · ‎11-25-2013

In addition to what Sergo asked, I'd like to know if you have a performance problem or you are tuning for the sake of it?

it would be very interesting also to know, not only the top 5 wait events, but also the wait for CPU. Go to SAP Note 1438410 and execute the script TimedEvents_TopTimedEvents in the SQL Editor (ST04)

Former Member · ‎11-24-2013

Hello we have a configuration similar to your , but currently only on BW servers (on ERP will be soon).
Can you show you top 5 events from AWR during high period of workload ? Can you provide iinformation from ST03n , what are usually your avarage dialog time per day during high load day? Do you have enougth CPU on database server ? Can you post info from st06 (from DB server) regarding 15 min "load avarage" during high workload ?
Regards , Sergo.

[Architecture] Question about performance - SAP + Oracle on Linux (single node, non RAC)

Accepted Solutions (1)

Accepted Solutions (1)

Answers (4)

Answers (4)

Re: How can assign in Identity Authentication Serv...

Re: Are there plans to update the Spring framework...

Vendor Invoice Screen 'Payment' tab screen field '...

I have data like Date, Net, Gross. I want to deriv...

Re: Re Generate Co files and data files