cancel
Showing results for 
Search instead for 
Did you mean: 

Background jobs getting stuck in Ready Status

liamclark1
Explorer
0 Kudos

Hi,

We are currently experiencing an issue with certain background jobs being scheduled/triggered but then getting stuck in ready status and not processing when it gets to their start time.

When they get stuck in ready status we can see in sm37 that the executing server is always the application server.

Our central server is a unix system and our application server(s) are windows both running on the oracle db hosted on the central server.

The jobs that are getting stuck are both technical (RDDIMPDP, SAP_COLLECTOR_PERFMON etc) and user triggered jobs.

I've noticed that sometimes we receive messages stating 'unable to connect to oracle' which im assuming is related as these seem to have picked up with the increase in stuck jobs.

and in work process logs these messages keep appearring:

dblink[db_reconnect]: { new_reconnect_message=1
dbcon[db_con_reconnect]: { reco_trials=3, reco_sleep_time=5
00: name=R/3, con_id=000000000, state=INACTIVE    , tx=NO , bc=NO , hc=NO , perm=YES, reco=NO , info=NO ,
     timeout=000, con_max=255, con_opt=255, occ=NO , prog=
dbcon[db_con_reconnect]: } rc=0
***LOG BV4=> reconnect state is set for the work process [dblink       1999]
***LOG BYY=> work process left reconnect status [dblink       2000]
dblink[db_reconnect]: } rc=0
ThHdlReconnect: reconnect o.k.

does anyone have an idea of where to look to resolve this?

We have looked at sap note 1902517 and tried various values for parameters SQLNET.SEND_TIMEOUT & SQLNET.RECV_TIMEOUT but they have not resolved this. Since upgrading our kernel and support packs on the system it seems to have worsened in our dev environment too (where these parameters were changed).

The only way to get round this is to go in sm37 and start the job off manually but it is too much time to do this for every stuck job.

Let me know if you want more information and thanks in advance

Liam

Accepted Solutions (0)

Answers (4)

Answers (4)

ken_halvorsen2
Active Participant
0 Kudos

Hi Liam

I've just experienced the same issue in our production system today, although we are using SQL Server. I noticed that all of the batch jobs in the Ready status could not be changed, rescheduled or cancelled. They didn't even have a work process that I could go to the server and kill the work process either.

There were available background processes on all of the 4 App servers so there were no wp contention.

What I did was cut all of the Background work processes in RZ04 CCMS: Maintain Work Process Distribution. After activating the specific "Mode", I was able to SM37, select the background job and run the "Job -> Check Status" which allowed me to start the process and cancel it.

I then added the background processes back into the mode and was able to run background jobs on that server again.

For every Batch job that was in that Ready state, there were Event log entries stating a Database error: Temse for table TST01 key (or similar), and failure to read status and Faile dto enter message 00& (application area 55) in job log.

My theory is that there was a communication issue between this app server and the msg server. Thankfully I didn't have to reboot the Production servers during the Business day but are planning to reboot on the weekend in the maintenance period.

Co-incidentally there was MS Server patches deployed the previous weekend.

Hopes this helps

Ken

liamclark1
Explorer
0 Kudos

Hi Ken,

Can you clarify what you did in RZ04 exactly? Im not very familiar with how that transaction works to be honest. When i go on RZ04 i only have the DUMMY mode there.

By 'cut all of the Background work processes' do you basically mean allocate 0 to them in that DUMMY mode?

annoyingly/ironically, we have not had any ready job issues in production over the last week but it is still appearing in our dev server. We are working with SAP too to try and resolve it.

I've found a note which is linked to the ora-12577 issue we are experiencing which basically says to upgrade kernel to a certain patch or above. ours is a few levels below so im going to upgrade dev tonight and see how it gets on after that.

we're also having issues with certain sap screens not being able to be read on the dev app server which is chucking out the ora-12577 message too and kicking users off. looking on sap theyve got a program to repair dynpros after an upgrade so going to give that a go too.

Regards,

Liam

ken_halvorsen2
Active Participant
0 Kudos

Yes, basically I'm changing the operation mode to have 0 Background processes.

So here's how I did it. (FYI, we do have a couple of different Operation modes created, but it shouldn't matter)

RZ04 - double clicked any of the Operation Modes (i.e. Dummy)

      Then under the specific server double click on the Operation Mode again and the CCMS: Maintain Work Process Distribuion pop up appears.

     Click in the Background "Number of work processes" box, then click on the " - " button underneath the Total, until the Background has no more or '0'.

Then click the Save button, then the Green arrow back and Save the Productive instance data.

Now go to transaction RZ03, Click on the Choose operation mode button, Select the operation mode (i.e. Dummy) and click on the Choose button.

Click on the server you want to change and from the Menu bar select Control -> Switch operation mode -> selected servers.

Click on the Yes button to Switch the selected server(s) to the Operation mode that you'd changed to have 0 background processes.

I found that now I was able to go to SM37, select the Jobs in the "Ready" state and follow the menu path Job -> Check Status.

On our system we have 4 app servers so we could get the jobs moved to other servers or stop them.

Don't foget that once you've got the Ready status Jobs cleared up, you will want to put the Background jobs back onto the operation mode that you'd taken them off of.

We've had it happen twice now and rebooted the servers after the second time this happened. And They always say "Upgrade the kernel" That may or may not work but couldn't hurt.

I don't think it's an Oracle type of problem, because we're running SQL Server 2005 on our systems. This is the first time that I've ever heard of something like this happeneing. We had implemented some MicroSoft patches on the weekend, so I'm figuring it was do to the MS Patches, and or the sequence that our Cluster servers were restarted.

Your screen problem sounds like a kernel or OS problem??? But I don't know, probably not related.

Like I always say "If it's not one thing it's '2' "

Ken

former_member185239
Active Contributor
0 Kudos

Hi Liam,

When your jobs are getting hung then run the transaction SARFC and paste the output .

If it is shows that few resources are available then you need to tune the sap parameters.

With Regards

Ashutosh Chaturvedi

liamclark1
Explorer
0 Kudos

Hi,

The resources are all ok but thanks, i did not know of that transaction

Regards,

Liam

former_member185239
Active Contributor
0 Kudos

Hi Liam,

Is your jobs are still in Ready status? is this happening for a particular time or everytime.

With Regards

Ashutosh Chaturvedi

liamclark1
Explorer
0 Kudos

Hi Ashutosh,

from what i've seen with our basis jobs, it doesnt seem to be everytime (for example a recurring 10 minute job doesnt get stuck every 10 mins)

For user triggered jobs, i cant really say as im not sure how often they are run. But i do not think it is everytime or a particular time.

Regards,

Liam

former_member185239
Active Contributor
0 Kudos

Hi Liam,

How many dialog instances are there in your system? is it occuring on particular dialog instance?

With Reagrds

Ashutosh Chaturvedi

liamclark1
Explorer
0 Kudos

Hi,

Do you mean how many app servers? or how many dialog processes?

in dev and qa just one app server for each

in prod we have 3 app servers

there are 10 dia processes in each server for dev and qa and 20 dialog processes in each server for prod

Please let me know if you are wanting different information?

Regards,

Liam

former_member185239
Active Contributor
0 Kudos

Hi Liam,

You are facing issue on production server Right?

On which app server , the job is schedule.

With Regards

Ashutosh Chaturvedi

Former Member
0 Kudos

When the jobs get stuck in ready status check for the available and used background processes.

It could be the reason that all background proceses are occupied and no WP is available to execute the job.

Also check if all application servers are used in load balancing so that available background WP from other apps can we utilised.

ACE-SAP
Active Contributor
0 Kudos

Hi

Do you have errors/alerts in the DB system log ?

You could check the maximum number of process used on your DB and verify if you get close to the limit you have defined with instance parameter 'process'.

select RESOURCE_NAME, INITIAL_ALLOCATION, MAX_UTILIZATION from v$resource_limit where RESOURCE_NAME in ('processes', 'sessions');

If so you could enhance the number of process (and the related session parameter, session = 2 x process ).

alter system set "PROCESSES" = 150 scope = spfile sid='*';

alter system set "SESSIONS" = 300 scope = spfile sid='*';

I'm do not think  that internal jobs could consume that much of process to there are no more available to serve SAP WP.

I'm not sure either that this error could come from an exhaustion if the Oracle process.

You could check parameter rsdb/reco_add_error_codes to verify that some Oracle errors are not trapped as DB disconnect problems (24806 - Database Reconnect: technical details and settings)

Regards


1431798 - Oracle 11.2.0: Database Parameter Settings

process = #ABAP work processes * 2 + #J2EE server processes * <max-connections> + PARALLEL_MAX_SERVERS + 40

liamclark1
Explorer
0 Kudos

Hi Yves,

Thanks for the info. I am going to adjust both the processes and sessions as ours is set to 100 and 192 whereas after doing the calculation it should be 149 and 298

What exactly does the MAX_UTILIZATION value mean? why is this different to the INITIAL_ALLOCATION?

e.g. for our current value processes is set to initial 100 and max util of 63

sessions is set to 192 initial and 73 max util

Regards,

Liam

ACE-SAP
Active Contributor
0 Kudos

Hi Liam,

Max utilization is the highest value used for that parameter, meaning the maximum number of process used.

But maybe we are not searching the in the right direction...

So this happens on a system with many instances, and the time driven jobs did only succeed to start on the central DB instance ?

If there are supposed to run on one of the Windows AS instances they get stuck.

Have a look on note  544881 - Composite SAP note: Time-driven jobs do not run

Regards

liamclark1
Explorer
0 Kudos

Hi Yves,

I take it that that value is controlled by the system itself then and cannot be modified by us?

Yes, that's right, the jobs are only getting stuck on the Windows App server

I will have a read of the note but im not sure if it is correct as it mentions

"Jobs with an "Immediate" or "After event" start condition run without any problems."

whereas on ours, jobs with after events (such as RDDIMPDP) are getting stuck

Regards,

Liam

ACE-SAP
Active Contributor
0 Kudos

Hi

No you can change it, the view only tells you the max number of process that has been launched. It allows you to check if the process parameter is set too low. In your case it seems to be ok as max used did not reach the number of process you defined.

Make sure that parameter rsts/enqueue/enabled is not set (1724201 - Background jobs remain in the status "ready" )

Are the failing jobs defined to run on a specific server or server group ?

Have look to note 1057255 - Jobs remain in status 'ready' bit i do not think this will help...

And to note  394677 - Jobs do not run on certain servers

The time scheduler runs periodically at short intervals on every background server. The event scheduler can start jobs only on the server on which it is running.

There is no point in several time schedulers running on the same server simultaneously. To avoid this situation, a server-specific semaphore is used.

From a technical perspective, for Basis releases lower than Release 7.10, this semaphore is an enqueue lock of the form

   Table = BTCREMTCLN    Argument = <SERVER NAME>

If you cannot delete this lock after the time scheduler has run due to an external error, the time scheduler no longer runs on this server, and almost no jobs can be started on this server as a result.

Perform also the additional test (/goto/additional test) in SM65 for all your servers.

Regards

liamclark1
Explorer
0 Kudos

Hi,

I deactivated the autotask jobs and set rdisp/btctime to 30 yesterday. This seems to have worked somewhat as there are only 2 ready jobs now and both are job 'SAP_COLLECTOR_FOR_NONE_R3_STAT' (program RSN3_STAT_COLLECTOR)

This has entries in SM21 saying

04:20:35 BTC  011 000 SAPSYS                      BY  M SQL error 3135 occurred; work process in reconnect status

04:20:35 BTC  011 000 SAPSYS                      F6  H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1

04:20:35 BTC  011 000 SAPSYS                      F6  H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1

04:20:35 BTC  011 000 SAPSYS                      F6  H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1

04:20:35 BTC  011 000 SAPSYS                      F6  H Database error: TemSe->XRTAB(0)->1 for table TST01 key [200]JOBLGX03203500X45925,1

04:20:35 BTC  011 000 SAPSYS                      EC  F Failed to create log for job SAP_COLLECTOR_FOR_NONE_R3_STAT 03203500/

04:20:35 BTC  011 000 SAPSYS                      EA  Y Failed to read status entry for job SAP_COLLECTOR_FOR_NONE_R3_STAT

04:20:35 BTC  011 000 SAPSYS                      EB  C > Job SAP_COLLECTOR_FOR_NONE_R3_STAT

04:20:35 BTC  011 000 SAPSYS                      EC  J Failed to enter message 00& (application area 55) in job log

04:20:35 BTC  011 000 SAPSYS                      EA  Y Failed to read status entry for job SAP_COLLECTOR_FOR_NONE_R3_STAT

04:20:35 BTC  011 000 SAPSYS                      EB  C > Job SAP_COLLECTOR_FOR_NONE_R3_STAT

04:20:35 BTC  011 000 SAPSYS                      F2  0 Calling program reports invalid handle for TemSe object (magic==X'NULL-ptr')

04:20:35 BTC  011                                 BY  M SQL error 3114 occurred; work process in reconnect status

04:20:35 BTC  011                                 BY  Y Work process has left reconnect status

There are also a number of entries in sm21 from this morning stating

00:05:48 DP                                       Q0  N Failed to send a request to the message server

00:05:49 DIA  009 000 SAPSYS                      GI  0 Error calling the central lock handler

00:05:49 DIA  009 000 SAPSYS                      GI  2 > Unable to reach central lock handler

After the system started again after a backup

Any ideas what has caused these?

Regards,

Liam

ACE-SAP
Active Contributor
0 Kudos

Hi

You should check if there are network problems between PAS & AS (central Unix box & Windows servers)

You can try some continuous ping (ping -t) or even better use SAP niping ()

Network issue could explain the other errors you get

=> NFS/SMB problem as some jobs are not able to write their log ( => 1 for table TST01 key [200]JOBLGX03203500X45925,) the job log should be on the globalhost share.

=> AS are not able to reach the enqueue process.

Regards

liamclark1
Explorer
0 Kudos

Hi,
We seem to be experiencing different errors now in SM21 when someone runs a transaction

Database error 12577 with GET access to table REPOLOAD
> ORA-12577: Message 12577 not found; product=RDBMS;
> facility=ORA#
> Include ??? line 0000.
Runtime error "DBIF_REPO_SQL_ERROR" occurred.

Also results of nipping from app server log

---------------------------------------------------
trc file: "niping.log", trc level: 2, release: "721"
---------------------------------------------------


Sat Nov 22 11:13:01 2014
NiIInit: allocated nitab (4096 at 0000000004310080)
NiIHSBufInit: initialize hostname buffer (IPv4)
NiHLInit: alloc host buf (100 entries)
NiSrvLInit: alloc serv bufs (100 entries)

command line arg 0: niping
command line arg 1: -c
command line arg 2: -H
command line arg 3: bubbas25
command line arg 4: -S
command line arg 5: 1527
command line arg 6: -B
command line arg 7: 131072
command line arg 8: -L
command line arg 9: 20
command line arg 10: -D
command line arg 11: 500
command line arg 12: -V
command line arg 13: 2
command line arg 14: -T
command line arg 15: c:\niping.log

NiSetParamEx: set NIP_SOCK_BUFFER_SIZE 32768
realloc origbuf from 0 to 131072 bytes
filling buffer with test data ...
NiHLGetNodeAddr: got hostname 'bubbas25' from operating system
NiIGetNodeAddr: hostname 'bubbas25' = addr 192.168.108.62
NiIGetServNo: servicename '1527' = port 1527
NiICreateHandle: hdl 1 state NI_INITIAL_CON
NiIInitSocket: set default settings for new hdl 1/sock 320 (I4; ST)
NiIBlockMode: set blockmode for hdl 1 FALSE
NiITraceByteOrder: CPU byte order: little endian, reverse network, low val .. high val
NiICheckPendConnection: connection of hdl 1 to 192.168.108.62:1527 established
NiIConnect: hdl 1 took local address 192.168.105.184:65352
NiIConnect: state of hdl 1 NI_CONNECTED
connect to server o.k.
NiIWrite: hdl 1 sent data (wrt=131072,pac=1,MESG_IO)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
realloc netin buf from 0 to 131072 bytes
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 received data (rcd=131072,pac=6,MESG_IO)

NiWait: sleep (500ms) ...
Sat Nov 22 11:13:02 2014
NiIWrite: hdl 1 sent data (wrt=131072,pac=1,MESG_IO)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 received data (rcd=131072,pac=21,MESG_IO)

NiWait: sleep (500ms) ...
NiIWrite: hdl 1 sent data (wrt=131072,pac=1,MESG_IO)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 recv would block (errno=EAGAIN)
NiIRead: hdl 1 received data (rcd=131072,pac=19,MESG_IO)

Please advise?

Thanks,

Liam

ACE-SAP
Active Contributor
0 Kudos

Hi

It is high time to involve the network team !

The network link between the Unix central PAS and the AS is having real problem.

There is nothing you can do at SAP level.

Regards

liamclark1
Explorer
0 Kudos

Hi, can i ask why you say that? is it from the sm21 logs or the niping result?

Its just i have read that results like "NiIRead: hdl 1 recv would block (errno=EAGAIN)" do not indicate an error

In this link:

Please advise?

Regards,

Liam

ACE-SAP
Active Contributor
0 Kudos

Your true, message errno=EAGAIN is not the sign of an error.

Event though with all the possibly network related error you get it would be good to have you network team checking the connections.

Regards

500235 - Network Diagnosis with NIPING

NIPING should not abort with an error message under any circumstances. An error is always indicated by a line starting with "*** ERROR ...".

Entries like "NiIRead: hdl 0 recv would block (errno=EAGAIN)" do not indicate an error!

Former Member
0 Kudos

Hi,

This issue seems to be because of your Oracle Processes are Full.

To resolve this check if Oracle Internal Maintenance Jobs are active and disable the same using SAP note #974781 - Oracle internal maintenance jobs.

Also check SAP note #1898521 - "Process list grows after ORA-01017; ORA-00020 " .

Hope this resolves your issue.

Regards

Bhupesh A

liamclark1
Explorer
0 Kudos

Hi, Thanks for this,

I have deactivated the jobs as described in note 974781 in our dev environment so will see how this affects our issue.

Our kernel has just been upgraded so it is already above the patch levels in the other note.

I will get back to you on whether it is successful or not

Regards,

Liam