on 11-22-2013 2:23 AM
Hello,
I've got one SAP system on Oracle - single Oracle instance without ASM. It has average (user calls average based on 2 months period) 17 000 user calls per second and daily around 2 million dialog steps + 1,5 million of RFC+UPD+BGD steps together.
We are following SAP rules regarding DB parameters and tuning expensive SQL statements (in top 20 selects sorted by DB time we don't have any
select with more than 5 blocks read by each row, I'm aware that it is not only one criteria, but it is to show that system is being tuned (not that all
are tuned but all of top resource consuming SQLs from cursor cache looks fine) from expensive SQLs perspective. I’m aware that very often expensive SQLs causing very often unnecessary load.
Our storage vendor did checks and claims that there are no hot spots on storage and it has a free resources to handle current load twice (IOPs). I would like to know where this configuration could have bottlenecks. Some times (especially when load on db is higher than usual) I see queue and wait times on dm* devices. Configuration looks in this way:
Linux SLES 11 SP1. LUNs configured from OS to storage:
Please let me know if you have experience with similar load and Oracle on Linux architecture and what configuration you have or what you would improve for above (please give a hints basing upon your experience)
Thanks in advance,
Marek
Just a question. Are you still on SLES 11 SP1? Any needs not to update to either SP2 or the even newer and current SLES 11 SP3?
Beside that - do you really have performance issues or are you trying to tune the system (which
could be a different approach).
A good option you already got is to switch-off the barriers for ext3 for SLES 11 SP1 as they really could slow.down the I/O.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Marek,
i just spotted this thread after you have replied yesterday. It seems like there are mixed up a lot of things together without considering the "right metrics" and requirements.
> in top 20 selects sorted by DB time … but it is to show that system is being tuned
Oracle DB Time <> DB Response Time (from client perspective). So the described approach is basically used to reduce the CPU or I/O load on the database server and not for "performance tuning" (e.g. PX can produce much more load, but may be faster in consequence). I have written a blog post about this topic and its misconception here:
> 17 000 user calls per second and daily around 2 million dialog steps + 1,5 million of RFC+UPD+BGD steps together.
You mentioned I/O queues and possible bottlenecks, but the mentioned key metrics have no relation to it (e.g. one database FETCH call can issue 50.000 I/Os or no I/O at all or think about DB backups, etc.). So you can not correlate any user calls per second to a specific CPU or I/O load.
> During high workload load average is between 40 and 50 … Regarding CPU - current 12 CPU cores are still enough - but we are going to replace hosts due to growing business and as during problematic situation system has been like a snow ball with waiting for I/O and normal CPU processing. Except problems due to bugs (SQL code, storage firmware bug) we haven't performance issues. We expect to have load twice higher during next year - thats the reason why I'm aware of performance.
So you have some specific issue like "a snow ball with waiting for I/O and normal CPU processing". Such issues could be profiled perfectly and drilled down with a tool called "perf" in newer kernel releases. I would try to figure out the root cause first, before switching anything just based on assumptions (as far as i get it right).
> Second thing with architecture is that I would avoid going to the RAC as system is not so big yet
Absolutely - if you can handle the current workload with 12 CPUs there is absolutely no reason to switch to RAC (for scale out reasons). Todays x86 commodity hardware can scale up to much more than your current hardware configuration. You have to generate a tremendous amount of load for the need of scale out solutions like RAC. That's it from a load perspective - however there maybe other reasons for switching to RAC depending on the business requirements (OS upgrade with no downtime, rolling upgrades, consolidation, etc.). It is also very easy to degrade performance with RAC, if you do not look carefully (e.g. cache fusion with 3-way block gets in worst case, etc.).
> Does anyone know any bigger SAP installations on ASM (w/o RAC or with RAC) on Linux (not on Exadata) - if yes please - post pros and cons.
Any bigger than what? You have provided no IOPs values or anything like that. Yes, i know / have seen a lot of "large" mission critical systems on ASM. "Pros" and "Cons" depend on the infrastructure, the used features and the internal knowledge (of each team member).
> Does anyone used ASM with SLES HA ?
Yes, you can use SLES HA (Pacemaker) with Oracle Restart (as ASM is part of the GI stack nowadays) and do the ASM disk group handling by HA script (as it needs to be mounted exclusively by non clustered ASM). In consequence each database need its own ASM disk group(s) and you need to test it very carefully.
> As I have other systems with around 1 milion dialog steps per day working on vSphere 5 without issues. Please let me know if you have experience with ERP systems 2 milions (and average more than 15 000 user calls in Oracle per second) or more dialog steps per day on VMWare.
You are looking at the wrong key metrics again. The amount of dialog steps has no correlation to the important factors like CPU or I/O load. You need to measure the (OS based) load of your current system and based on that data you decide if it is possible. VMware has limitations and some of them are increasing (existing) performance issues (e.g. vCPUs and its scheduling in case of allocation). I have built up high performance Oracle HA infrastructures on VMware (e.g. with RDM, etc.) and it runs well, but only if you are not hitting some specific limits / constellations.
I highly recommend that you collect the required data first and make your decisions based on that (and not on things like database calls or dialog steps):
All of the following "rule of thumb" values are based on unicode systems:
… or …
Ok that are my 2 cents so far for free … i hope this helps to go in the right direction.
Regards
Stefan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Stefen,
thank you for a lot of information. They are helpful - most of them for the people who without broader experience in sizing. Your information is out of the topic in relation to my question/intention of my post. I didn't ask how to check it and get the results, but does anyone have similar architecture used for ERP system with a lot of "traffic" .
Sometimes you get things out of the context - like topic with performance problems (please read once again previous posts if you want to dig in it) and I see that your post is more to promote you and your knowledge (which I appreciate) than direct sharing useful examples from the real life - instead of writing provoking questions:
"Any bigger than what?"
which t doesn't bring any value - better please provide example of architecture from your expierence. I agree that dialog steps number is not always the best indicator, but if it is not enough (and system has different characteristic in comparision to commonly used ERP functionalites) - please provide more details in your example of architecture about what I was asking.
At the end I'm not giving even one cent (project has been postponed by business one year ahead - so now I couldn't give my feedback from migration, but VMWare + single Oracle instance with ASM will be probably the target),
Best Regards,
Marek
P.S.for free - avoid saying that somebody is using sth. wrong (like wrong metrics) - at first understand sb. needs - people don't like when sb. tells that that used sth. wrong ;-). For sizing metrics which I gave were wrong - for overview question about architecture from real life- for me they were ok.
Hello,
sorry for delay. I had longer break :-).
Sergo - during high workload load average is between 40 and 50. During normal day (not end of month etc) maximum is around 30. But this value is a result of CPUs, storage devices etc (I assume that you know how it is calculated). We have had problems. They were only connected once with bad coding (SQLs) and once with bug in storage firmware (then load average was around 70 - 80 and system performance was not acceptable). Regarding CPU - current 12 CPU cores (with HT it gives 24 cores, but HT shouldn't be perceived directly as computing power + sometimes could decrease performance of single instead of improving it) are still enough - but we are going to replace hosts due to growing business and as during problematic situation system has been like a snow ball with waiting for I/O and normal CPU processing (I'm aware that not enough IOPS & throughput could kill as many CPUs as there would be, but the goal is to have balanced system [I/O <-> CPU] with reserve for peaks)
Sergo, Fidel, some of data for last 7 days - I can't give exact data here, below are some rounded statistics (please keep in mind this is the period with a lot days off - there were little more than 8 milion dialog steps only):
EVENT_NAME | AVG_MS | PERCENT |
Network | less than 0,5 | around 42% |
db file sequential read | less than 3 | around 41% |
CPU | around 13% | |
enq: TX - row lock contention | around 60 | around 1% |
read by other session | around 4 | less than 1% |
Fabian - thanks for information about SP3. We are going with upgrade to SLES 11 SP3. Linux mechanisms (at least looking at newer kernel and multipathing implementation) should be more efficent .Regarding reason of my question: except problems due to bugs (SQL code, storage firmware bug) we haven't performance issues. We expect to have load twice higher during next year - thats the reason why I'm aware of performance. We are currently in the process of exchange hosts and storage. However during exchange we would like to remove possible bottlenecks in software layer and currently we considering Oracle ASM (however it has also some cons). Does anyone know any bigger SAP installations on ASM (w/o RAC or with RAC) on Linux (not on Exadata) - if yes please - post pros and cons.
Second thing with architecture is that I would avoid going to the RAC as system is not so big yet.. Does anyone used ASM with SLES HA ?
Third thing is that I would go to the VMWare. As I have other systems with around 1 milion dialog steps per day working on vSphere 5 without issues. Please let me know if you have expierience with ERP systems 2 milions (and average more than 15 000 user calls in Oracle per second) or more dialog steps per day on VMWare.
Thanks in advance.
Marek
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
In addition to what Sergo asked, I'd like to know if you have a performance problem or you are tuning for the sake of it?
it would be very interesting also to know, not only the top 5 wait events, but also the wait for CPU. Go to SAP Note 1438410 and execute the script TimedEvents_TopTimedEvents in the SQL Editor (ST04)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello we have a configuration similar to your , but currently only on BW servers (on ERP will be soon).
Can you show you top 5 events from AWR during high period of workload ? Can you provide iinformation from ST03n , what are usually your avarage dialog time per day during high load day? Do you have enougth CPU on database server ? Can you post info from st06 (from DB server) regarding 15 min "load avarage" during high workload ?
Regards , Sergo.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
95 | |
11 | |
10 | |
9 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.