on 11-20-2014 12:01 PM
Hi,
We are currently experiencing an issue with certain background jobs being scheduled/triggered but then getting stuck in ready status and not processing when it gets to their start time.
When they get stuck in ready status we can see in sm37 that the executing server is always the application server.
Our central server is a unix system and our application server(s) are windows both running on the oracle db hosted on the central server.
The jobs that are getting stuck are both technical (RDDIMPDP, SAP_COLLECTOR_PERFMON etc) and user triggered jobs.
I've noticed that sometimes we receive messages stating 'unable to connect to oracle' which im assuming is related as these seem to have picked up with the increase in stuck jobs.
and in work process logs these messages keep appearring:
dblink[db_reconnect]: { new_reconnect_message=1
dbcon[db_con_reconnect]: { reco_trials=3, reco_sleep_time=5
00: name=R/3, con_id=000000000, state=INACTIVE , tx=NO , bc=NO , hc=NO , perm=YES, reco=NO , info=NO ,
timeout=000, con_max=255, con_opt=255, occ=NO , prog=
dbcon[db_con_reconnect]: } rc=0
***LOG BV4=> reconnect state is set for the work process [dblink 1999]
***LOG BYY=> work process left reconnect status [dblink 2000]
dblink[db_reconnect]: } rc=0
ThHdlReconnect: reconnect o.k.
does anyone have an idea of where to look to resolve this?
We have looked at sap note 1902517 and tried various values for parameters SQLNET.SEND_TIMEOUT & SQLNET.RECV_TIMEOUT but they have not resolved this. Since upgrading our kernel and support packs on the system it seems to have worsened in our dev environment too (where these parameters were changed).
The only way to get round this is to go in sm37 and start the job off manually but it is too much time to do this for every stuck job.
Let me know if you want more information and thanks in advance
Liam
Hi Liam
I've just experienced the same issue in our production system today, although we are using SQL Server. I noticed that all of the batch jobs in the Ready status could not be changed, rescheduled or cancelled. They didn't even have a work process that I could go to the server and kill the work process either.
There were available background processes on all of the 4 App servers so there were no wp contention.
What I did was cut all of the Background work processes in RZ04 CCMS: Maintain Work Process Distribution. After activating the specific "Mode", I was able to SM37, select the background job and run the "Job -> Check Status" which allowed me to start the process and cancel it.
I then added the background processes back into the mode and was able to run background jobs on that server again.
For every Batch job that was in that Ready state, there were Event log entries stating a Database error: Temse for table TST01 key (or similar), and failure to read status and Faile dto enter message 00& (application area 55) in job log.
My theory is that there was a communication issue between this app server and the msg server. Thankfully I didn't have to reboot the Production servers during the Business day but are planning to reboot on the weekend in the maintenance period.
Co-incidentally there was MS Server patches deployed the previous weekend.
Hopes this helps
Ken
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Ken,
Can you clarify what you did in RZ04 exactly? Im not very familiar with how that transaction works to be honest. When i go on RZ04 i only have the DUMMY mode there.
By 'cut all of the Background work processes' do you basically mean allocate 0 to them in that DUMMY mode?
annoyingly/ironically, we have not had any ready job issues in production over the last week but it is still appearing in our dev server. We are working with SAP too to try and resolve it.
I've found a note which is linked to the ora-12577 issue we are experiencing which basically says to upgrade kernel to a certain patch or above. ours is a few levels below so im going to upgrade dev tonight and see how it gets on after that.
we're also having issues with certain sap screens not being able to be read on the dev app server which is chucking out the ora-12577 message too and kicking users off. looking on sap theyve got a program to repair dynpros after an upgrade so going to give that a go too.
Regards,
Liam
Yes, basically I'm changing the operation mode to have 0 Background processes.
So here's how I did it. (FYI, we do have a couple of different Operation modes created, but it shouldn't matter)
RZ04 - double clicked any of the Operation Modes (i.e. Dummy)
Then under the specific server double click on the Operation Mode again and the CCMS: Maintain Work Process Distribuion pop up appears.
Click in the Background "Number of work processes" box, then click on the " - " button underneath the Total, until the Background has no more or '0'.
Then click the Save button, then the Green arrow back and Save the Productive instance data.
Now go to transaction RZ03, Click on the Choose operation mode button, Select the operation mode (i.e. Dummy) and click on the Choose button.
Click on the server you want to change and from the Menu bar select Control -> Switch operation mode -> selected servers.
Click on the Yes button to Switch the selected server(s) to the Operation mode that you'd changed to have 0 background processes.
I found that now I was able to go to SM37, select the Jobs in the "Ready" state and follow the menu path Job -> Check Status.
On our system we have 4 app servers so we could get the jobs moved to other servers or stop them.
Don't foget that once you've got the Ready status Jobs cleared up, you will want to put the Background jobs back onto the operation mode that you'd taken them off of.
We've had it happen twice now and rebooted the servers after the second time this happened. And They always say "Upgrade the kernel" That may or may not work but couldn't hurt.
I don't think it's an Oracle type of problem, because we're running SQL Server 2005 on our systems. This is the first time that I've ever heard of something like this happeneing. We had implemented some MicroSoft patches on the weekend, so I'm figuring it was do to the MS Patches, and or the sequence that our Cluster servers were restarted.
Your screen problem sounds like a kernel or OS problem??? But I don't know, probably not related.
Like I always say "If it's not one thing it's '2' "
Ken
Hi Liam,
When your jobs are getting hung then run the transaction SARFC and paste the output .
If it is shows that few resources are available then you need to tune the sap parameters.
With Regards
Ashutosh Chaturvedi
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Ashutosh,
from what i've seen with our basis jobs, it doesnt seem to be everytime (for example a recurring 10 minute job doesnt get stuck every 10 mins)
For user triggered jobs, i cant really say as im not sure how often they are run. But i do not think it is everytime or a particular time.
Regards,
Liam
Hi,
Do you mean how many app servers? or how many dialog processes?
in dev and qa just one app server for each
in prod we have 3 app servers
there are 10 dia processes in each server for dev and qa and 20 dialog processes in each server for prod
Please let me know if you are wanting different information?
Regards,
Liam
When the jobs get stuck in ready status check for the available and used background processes.
It could be the reason that all background proceses are occupied and no WP is available to execute the job.
Also check if all application servers are used in load balancing so that available background WP from other apps can we utilised.
Hi
Do you have errors/alerts in the DB system log ?
You could check the maximum number of process used on your DB and verify if you get close to the limit you have defined with instance parameter 'process'.
select RESOURCE_NAME, INITIAL_ALLOCATION, MAX_UTILIZATION from v$resource_limit where RESOURCE_NAME in ('processes', 'sessions');
If so you could enhance the number of process (and the related session parameter, session = 2 x process ).
alter system set "PROCESSES" = 150 scope = spfile sid='*';
alter system set "SESSIONS" = 300 scope = spfile sid='*';
I'm do not think that internal jobs could consume that much of process to there are no more available to serve SAP WP.
I'm not sure either that this error could come from an exhaustion if the Oracle process.
You could check parameter rsdb/reco_add_error_codes to verify that some Oracle errors are not trapped as DB disconnect problems (24806 - Database Reconnect: technical details and settings)
Regards
1431798 - Oracle 11.2.0: Database Parameter Settings |
process = #ABAP work processes * 2 + #J2EE server processes * <max-connections> + PARALLEL_MAX_SERVERS + 40
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Yves,
Thanks for the info. I am going to adjust both the processes and sessions as ours is set to 100 and 192 whereas after doing the calculation it should be 149 and 298
What exactly does the MAX_UTILIZATION value mean? why is this different to the INITIAL_ALLOCATION?
e.g. for our current value processes is set to initial 100 and max util of 63
sessions is set to 192 initial and 73 max util
Regards,
Liam
Hi Liam,
Max utilization is the highest value used for that parameter, meaning the maximum number of process used.
But maybe we are not searching the in the right direction...
So this happens on a system with many instances, and the time driven jobs did only succeed to start on the central DB instance ?
If there are supposed to run on one of the Windows AS instances they get stuck.
Have a look on note 544881 - Composite SAP note: Time-driven jobs do not run
Regards
Hi Yves,
I take it that that value is controlled by the system itself then and cannot be modified by us?
Yes, that's right, the jobs are only getting stuck on the Windows App server
I will have a read of the note but im not sure if it is correct as it mentions
"Jobs with an "Immediate" or "After event" start condition run without any problems."
whereas on ours, jobs with after events (such as RDDIMPDP) are getting stuck
Regards,
Liam
Hi
No you can change it, the view only tells you the max number of process that has been launched. It allows you to check if the process parameter is set too low. In your case it seems to be ok as max used did not reach the number of process you defined.
Make sure that parameter rsts/enqueue/enabled is not set (1724201 - Background jobs remain in the status "ready" )
Are the failing jobs defined to run on a specific server or server group ?
Have look to note 1057255 - Jobs remain in status 'ready' bit i do not think this will help...
And to note 394677 - Jobs do not run on certain servers
The time scheduler runs periodically at short intervals on every background server. The event scheduler can start jobs only on the server on which it is running.
There is no point in several time schedulers running on the same server simultaneously. To avoid this situation, a server-specific semaphore is used.
From a technical perspective, for Basis releases lower than Release 7.10, this semaphore is an enqueue lock of the form
Table = BTCREMTCLN Argument = <SERVER NAME>
If you cannot delete this lock after the time scheduler has run due to an external error, the time scheduler no longer runs on this server, and almost no jobs can be started on this server as a result.
Perform also the additional test (/goto/additional test) in SM65 for all your servers.
Regards
Hi,
I deactivated the autotask jobs and set rdisp/btctime to 30 yesterday. This seems to have worked somewhat as there are only 2 ready jobs now and both are job 'SAP_COLLECTOR_FOR_NONE_R3_STAT' (program RSN3_STAT_COLLECTOR)
This has entries in SM21 saying
04:20:35 BTC 011 000 SAPSYS BY M SQL error 3135 occurred; work process in reconnect status
04:20:35 BTC 011 000 SAPSYS F6 H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1
04:20:35 BTC 011 000 SAPSYS F6 H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1
04:20:35 BTC 011 000 SAPSYS F6 H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1
04:20:35 BTC 011 000 SAPSYS F6 H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1
04:20:35 BTC 011 000 SAPSYS EC F Failed to create log for job SAP_COLLECTOR_FOR_NONE_R3_STAT 03203500/
04:20:35 BTC 011 000 SAPSYS EA Y Failed to read status entry for job SAP_COLLECTOR_FOR_NONE_R3_STAT
04:20:35 BTC 011 000 SAPSYS EB C > Job SAP_COLLECTOR_FOR_NONE_R3_STAT
04:20:35 BTC 011 000 SAPSYS EC J Failed to enter message 00& (application area 55) in job log
04:20:35 BTC 011 000 SAPSYS EA Y Failed to read status entry for job SAP_COLLECTOR_FOR_NONE_R3_STAT
04:20:35 BTC 011 000 SAPSYS EB C > Job SAP_COLLECTOR_FOR_NONE_R3_STAT
04:20:35 BTC 011 000 SAPSYS F2 0 Calling program reports invalid handle for TemSe object (magic==X'NULL-ptr')
04:20:35 BTC 011 BY M SQL error 3114 occurred; work process in reconnect status
04:20:35 BTC 011 BY Y Work process has left reconnect status
There are also a number of entries in sm21 from this morning stating
00:05:48 DP Q0 N Failed to send a request to the message server
00:05:49 DIA 009 000 SAPSYS GI 0 Error calling the central lock handler
00:05:49 DIA 009 000 SAPSYS GI 2 > Unable to reach central lock handler
After the system started again after a backup
Any ideas what has caused these?
Regards,
Liam
Hi
You should check if there are network problems between PAS & AS (central Unix box & Windows servers)
You can try some continuous ping (ping -t) or even better use SAP niping ()
Network issue could explain the other errors you get
=> NFS/SMB problem as some jobs are not able to write their log ( => 1 for table TST01 key [200]JOBLGX03203500X45925,) the job log should be on the globalhost share.
=> AS are not able to reach the enqueue process.
Regards
Hi,
We seem to be experiencing different errors now in SM21 when someone runs a transaction
Database error 12577 with GET access to table REPOLOAD
> ORA-12577: Message 12577 not found; product=RDBMS;
> facility=ORA#
> Include ??? line 0000.
Runtime error "DBIF_REPO_SQL_ERROR" occurred.
Also results of nipping from app server log
---------------------------------------------------
trc file: "niping.log", trc level: 2, release: "721"
---------------------------------------------------
Sat Nov 22 11:13:01 2014
NiIInit: allocated nitab (4096 at 0000000004310080)
NiIHSBufInit: initialize hostname buffer (IPv4)
NiHLInit: alloc host buf (100 entries)
NiSrvLInit: alloc serv bufs (100 entries)
command line arg 0: niping
command line arg 1: -c
command line arg 2: -H
command line arg 3: bubbas25
command line arg 4: -S
command line arg 5: 1527
command line arg 6: -B
command line arg 7: 131072
command line arg 8: -L
command line arg 9: 20
command line arg 10: -D
command line arg 11: 500
command line arg 12: -V
command line arg 13: 2
command line arg 14: -T
command line arg 15: c:\niping.log
NiSetParamEx: set NIP_SOCK_BUFFER_SIZE 32768
realloc origbuf from 0 to 131072 bytes
filling buffer with test data ...
NiHLGetNodeAddr: got hostname 'bubbas25' from operating system
NiIGetNodeAddr: hostname 'bubbas25' = addr 192.168.108.62
NiIGetServNo: servicename '1527' = port 1527
NiICreateHandle: hdl 1 state NI_INITIAL_CON
NiIInitSocket: set default settings for new hdl 1/sock 320 (I4; ST)
NiIBlockMode: set blockmode for hdl 1 FALSE
NiITraceByteOrder: CPU byte order: little endian, reverse network, low val .. high val
NiICheckPendConnection: connection of hdl 1 to 192.168.108.62:1527 established
NiIConnect: hdl 1 took local address 192.168.105.184:65352
NiIConnect: state of hdl 1 NI_CONNECTED
connect to server o.k.
NiIWrite: hdl 1 sent data (wrt=131072,pac=1,MESG_IO)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
realloc netin buf from 0 to 131072 bytes
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 received data (rcd=131072,pac=6,MESG_IO)
NiWait: sleep (500ms) ...
Sat Nov 22 11:13:02 2014
NiIWrite: hdl 1 sent data (wrt=131072,pac=1,MESG_IO)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 received data (rcd=131072,pac=21,MESG_IO)
NiWait: sleep (500ms) ...
NiIWrite: hdl 1 sent data (wrt=131072,pac=1,MESG_IO)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 received data (rcd=131072,pac=19,MESG_IO)
Please advise?
Thanks,
Liam
Your true, message errno=EAGAIN is not the sign of an error.
Event though with all the possibly network related error you get it would be good to have you network team checking the connections.
Regards
500235 - Network Diagnosis with NIPING
NIPING should not abort with an error message under any circumstances. An error is always indicated by a line starting with "*** ERROR ...".
Entries like "NiIRead: hdl 0 recv would block (errno=EAGAIN)" do not indicate an error!
Hi,
This issue seems to be because of your Oracle Processes are Full.
To resolve this check if Oracle Internal Maintenance Jobs are active and disable the same using SAP note #974781 - Oracle internal maintenance jobs.
Also check SAP note #1898521 - "Process list grows after ORA-01017; ORA-00020 " .
Hope this resolves your issue.
Regards
Bhupesh A
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi, Thanks for this,
I have deactivated the jobs as described in note 974781 in our dev environment so will see how this affects our issue.
Our kernel has just been upgraded so it is already above the patch levels in the other note.
I will get back to you on whether it is successful or not
Regards,
Liam
User | Count |
---|---|
81 | |
9 | |
9 | |
7 | |
7 | |
6 | |
6 | |
6 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.