on 10-20-2015 4:38 PM
We're running out of procedure cache and from what I can tell, this happens as MEMC_SCOMPILE_1, shown by monProcedureCacheMemoryUsage table, grows to extremely high amount (over 6Gb).
I'm looking at the monProcedureCacheMemoryUsage and monProcedureCacheModuleUsage and I was expecting the totals for Module to match each other. But they don't.
I run this
select ModuleID, sum(Active) Active
into #memusage
from master..monProcedureCacheMemoryUsage
group by ModuleID
order by ModuleID
select mod.ModuleID, mod.ModuleName, mod.Active ModActive, mem.Active MemActive
from #modusage mod, #memusage mem where mod.ModuleID = mem.ModuleID
and get
ModuleID ModuleName ModActive MemActive
----------- ------------------------------ ----------- -----------
1 Parser 6 6
2 Utilities 9 16
4 Optimizer 0 0
5 Execution 3512 2097003
6 Access 0 0
7 Backup 0 0
8 Recovery 0 0
9 Replication 55 55
10 Procedural Objects 2022359 7106
11 Sort 5 5
12 HK GC 20 20
14 BLOB Management 0 0
15 Partition Conditions 737 830
16 Pdes Local HashTab 9 9
17 Statement Cache 80192 6184
18 CIS 4 4
19 Frame Management 8 8
21 Network 0 0
22 Procmem Control 5850 5850
23 Data change 633 633
26 Multiple Temporary Database 4 4
Many Modules do match but some (especially the large ones dont). When we run out of procedures cache we have, Module 10 in module usage showing over 3,000,000 and Module 5 of memory usage showing over 3,000,000.
Should the totals in the 2 tables match ?
If not why not ?
Do other people get the same results ?
I've seem some bug fixes in 15.7 SP136 - is this a symptom of these issues ?
Thanks
Mike
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks again for this ...
I've read the posting and this one Spinlocks and CPU usage in SAP ASE - SAP ASE - SCN Wiki and the doc http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/208c05a1-d739-3210-f0ae-f25f343c0...
which has helped to understand the issue ... but there's something I'm not very clear on regarding the Engine Local Cache
As I understand it
The ELC is allocated 50% (default) of the free procedure cache, but what happens when a ELC has allocated its space to procs. Is the ELC initially allocated at start up ?
On our server, we have 12Gb of procedure cache and 40 engines.
Am I correct in assuming, that each ELC is initially allocated (12Gb * 0.5) / 40 = 150Mb
What happens when, one or more of the ELCs uses its initial allocation ?
Does it then grab another block of memory based on 50% of the free space ie (6Gb * 0.5) / 40 = 75Mb
and when this is used up it grabs another 50% of free space depending on how much has been allocated ?
Or does the ELC get its allocations from the global cache is smaller chunks ? ie 2k pages.
From what I can see happening on our server, once we reach 50% of procedure cache used, we have huge spinlock contention on the global procedure cache which leads me to believe that the ELC is being allocated frequently in small chunks ? If so, then why wouldn't I just allocate > 70% of the procedure cache to the ELCs ?
I'm looking at the trace flags 753 and 757 but can't see what 757 actually does. Does anyone know the details ?
Thanks
Mike
After a lot of monitoring and reading I think we've found the issue (or at the very least we've found some solutions/workarounds).
Here are some useful links
https://scn.sap.com/thread/3746247
http://www.ndm.net/bi/pdf/Managing-DBMS-Workloads-v1.0-WP.pdf
http://wiki.scn.sap.com/wiki/display/SYBASE/Spinlocks+and+CPU+usage+in+SAP+ASE
http://scn.sap.com/thread/3658742
http://mattzhang-tech.blogspot.co.uk/2013/01/stored-procedure-processing-in-sybase.html
I agree with Simon's comment that the ELC was a "little rushed" and doesn't help in every case.
We've decided on these action ... if anyone thinks this is wrong let me know - thanks.
Assuming the monProcedureCacheMemoryUsage and monProcedureCacheModuleUsage are reporting the correct values, then we're not running short of procedure cache, but we are potentially ending up with so much fragmentation that requests for 4k or more procedure cache space are having to go to the global procedure and when many processes are doing this the spinlock contention is huge. In our case we end up with over 90% fragmentation.
So far we've worked around this by clearing out the procedure cache (dbcc proc(free_unused)) - but this seems heavy handed to me and is surely only a workaround.
We also tried dbcc proc(flush_elc) and this improved responsiveness of a server, but we don't know why this would be the case. We really need to understand in detail
how the ELC and procedure cache interact.
We're going to increase the amount of memory to the procedure cache to allow for fragmentation - although 16Gb does seeem rather high
(years ago a server of 32Gb RAM was considered large)
Additionially, we're going to increase the 'engine local cache percent' to 75% to give more allocation to the ELC which we hope will also reduce contention.
According to the manual the ELC should be allocated using the calc : proc cache size * 'engine local cache percent' / engine.
That should give us 150,00 pages but the log says "Proc header memory allocated 18891 pages for engine 1 local cache".
Haven't heard back why these figures are inconsistent ?
I think a better solution is using traceflag 753, which (if I've read this correctly) will only allocate 2k pages. This will reduce the fragmentation and means a page will always be available. If 16k is required, then 8 requests will be needed so will in theory be slower. In effect we're accepting lower average performance for much less contention (and since our machine has been un-responsive at times thats ok with us)
We've also looking at traceflag 757 which might be more aggressive in its procedure cache eviction - but that seems over the top when only half the cache is full.
Thanks again Simon.
Hi Mike,
I'll avoid going into the nitty gritty on this one but I'm glad you have a potential solution 🙂
Lastly, your proc cache state is definitely a sure sign that 753 may help you. if you have tons of compilation going on (we force 65% of our SQL), you may need to care with it. The good thing is it is entirely dynamic, you can enable it using dbcc traceon(753) and it will immediately stop > 2KB allocations for all spids. If you have a problem, just turn it off again.
Be aware of at least one gotcha that is apparent even when running with 753. Large ad-hoc SQL statements (16KB and above) appear to seek out contiguous chunks of proc cache in order to cache the sql text, this can genuinely bring processes to a standstill and cause ELCs to flush ad infinitum. You'll also see proc bufs being flushed out of the cache for fun despite you having tons of free space (but all in small chunks of course). We ran into this on a recent upgrade. We are now running traceflag 7790 (this stops any sql statement over 16KB being cached) on all our 15.7 production servers.
Cheers,
Simon
Thanks again... agreed the rushed bit being large allocations - didn't mean to mis-quote you.
Interesting read on trace flag 7790. I'll have to try this as well (although I'm meant to be spending my time coding not dealing with production issues!!)
I'm also looking at the output from procbuf and this shows some oddities.
Have you seen anything like this...
pbname='Build' pbprocnum=1 pbsharecnt=0 pboldplan=(nil)
<snip>
Total # of PROCHDRs used : 6
Total # of pages used : 84
Total # of bytes used : 160297
Total # of lost bytes for alignment : 15364359
Total # of pages lost for alignment : 7502
Total percentage of memory lost : 9584
More bytes lost than used - Seems odd to me ?
There are some fixes/improvements to procedure cache usage in SP135 - have you tried it yet ?
I wouldn't start trying to make sense of that output, I can't tell you what counter it is reading. Maybe look at what the other procbufs say, is that one sticking out as odd? Is there something funky with the 'Build' proc? (assuming it's a proc of course)
Haven't looked at SP135 really yet, what CRs are you referring to?
Simon,
We tried traceflag 753 and we still get the problem.
Over a 30s period we see
SpinlockName SpinlockSlotID Grabs Spins Waits Contention
-------------------------------------- ------------------ --------- ---------------- ------- --------------
Resource->rproccache_spin 1401 14129 1356281287 13822 97.82
Total amount of procedure cache used is
master..monProcedureCacheModuleUsage reports : 5070996 pages
master..monProcedureCacheMemoryUsage reports : 5070950 pages
master..monCachedProcedures reports : 9,638,510 Kb (9.6GB)
We'll try traceflag 7790 when it happens again.
Thanks
Mike
That is a ridiculously huge amount of spins per wait and crippling contention. Are you also seeing large amounts of re-use during these periods?
Can you pull out a dbcc stacktrace output for all spids when in this state? Maybe run the script 2 or 3 times. Note it 'should' be safe to run on production but run it at your own risk, I cannot give any guarantees it won't cause some unexplained issue specific to your environment.
BTW, have you logged an incident with SAP?
dbcc traceon (3604)
go
select SPID spid into #t from master..monProcess
create unique index ix on #t(spid)
go
set nocount on
declare @spid int
declare @rowcount int
while 1=1
begin
select top 1 @spid=spid from #t (index ix)
select @rowcount=@@rowcount
if @rowcount<>1
break
print 'stack trace and sql text for spid %1!', @spid
dbcc stacktrace (@spid)
dbcc sqltext (@spid)
print ''
delete from #t where spid=@spid
end
go
drop table #t
go
Check for Reuse_cnt for 'procedure cache' in sp_monitorconfig 'all'.
Do you see this figure increase during these periods? This is a measure of how many plans are being forced out of cache - an indication that maybe something is looking for contiguous space that it can't find. I assume you have spare proc cache being reported and it isn't just full?
That implies your max used is only about 80% which isn't brilliant but I wouldn't expect it to be as bad as your spinlock output shows. Is it at the HWM when you have the issue?
I'd suggest running that script I posted during the times when it happens and also look at the 'MemUsageKB' field in monProcessActivity to see if it gives any clues as to who may be using the proc cache.
It is possible that the large statement scenario (resolved with TF 7790) is happening, but the stack trace outputs should confirm. You should really pursue this one through tech support if it is causing you business impact.
Thanks a lot for your help.
We have a case open so hopefully we'll have more information from them. Its not actually affecting production at the moment as we're throttling the load - but we'd like to run at full capacity - but from what we can see the capacity is a function of time since reboot.
We've been able to induce this on a test env by running 100s of procs simultaneously.
[On production we managed to see the issue with ~100 procs]
The one consistent factor is that it always happens at about 80% usage of the procedure cache.
Running your sql shows the top function of the stack trace to be
either upyield or upsleepgeneric
Not sure what I should be looking for - but I'll have a look through the 100's of spids.
We'll try TF 7790; I suspect it will help but mostly by delaying the time for the problem to occur.
We're going to increase the amount of procedure cache as well.
Thanks again for your help - its most appreciated.
Mike
Now that we've fixed our issue with procedure cache contention I thought I'd share
some findings.
I repeated ran a proc in a loop that was using a temp table in many parallel connections (150,000 proc calls)
Initial time each run was taking : 1m 5s
After the procedure cache was over 50% full,
the time to run : 3m 30s
Turned on Trace Flag 753, and re-ran : 1m 5s
Also tried,
Set "optimize temp table resolution" on, time taken : 27s
In this particular case the proc uses a temp table but is a very simple proc.
I haven't tried different procs or different complexity and length.
Hope this helps if anyone else has the same problem.
User | Count |
---|---|
84 | |
10 | |
10 | |
10 | |
7 | |
6 | |
6 | |
5 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.