Currently we are running Version 7.20.8 2013-06-21 720_VAL_REL
About two weeks ago our system stopped provisioning users over a holiday break and it went unchecked allowing the provisioning queue to build to about 3400 entries. I have been tasked to resolve the issue that has our production environment at a stand still with no real help coming from a High Severity SAP message for the last 5 days now.
I have performed the following steps before SAP was contacted.
stopped dispatchers and restarted and got the following errors on the dispatcher log:
exception: The Server Failed to resume the transaction. Desc:40000000001c
SQL Exception. The server failed to secure the transaction Desc.4000000001c
In the job/system log we would get a few notifications that tasks were complete either good or with warnings that went through either just before or after the Dispatcher error came through.
Then nothing would provision and the semaphore table would have a lock on it from the dispatcher that would just sit in sleep mode and never release. We could stop and restart the dispatcher and get a couple of jobs to run with the same results.
Secondary to this the other dispatchers would now not run at all.
when the "select * from MXP_Provision" query is run from SQL management studio our queue has 3400+ entires with 17 in a failed status waiting on another job that failed.
Next reassociated the correct .jar files to the dispatchers and regenerated scripts this had no effect same issues as stated above.
reinstalled Dispatcher services, this had no effect same issues as above but now no furuther jobs were coming through. queue no longer shrinking, just jobs sitting in failed status, for other jobs or sitting in pending/sleeping/waiting status
Contacted SAP confirmed we are not running in cluster mode for SQL confirmed our two dispatchers (one for provisioning one for housekeeping) and provided SQL SQL SEMAPHORE lock table data confirming that no other process is running except for the locked semaphore in a sleeping status. (waiting on system information to release...we are presuming that this lock is causing other jobs to fail and other dispatchers to no start)
SAP sent us a fix in the following note reply:
----------------------------------------------------------------------------------------------------------
SAP has a script that extracts important information from SQL
Server called hangman that can be run in situations like this,
https://service.sap.com/sap/support/notes/948633
But I see Christopher Leonard and others have already requested
the maininformation this gives, so more importantly, lets try to resolve the issue.
The problem I already mentioned that we suspect this to be is
that atable or lock remains on the MC_SEMAPHORE table after commit is
issued and that remains there until the session (dispatcher process) is
a terminated. We've created a replacement set of semaphore
handling
procedures that attempt to work around and solve this issue.
These have been attached, and can be installed in your environment.
The new version is in the ProcAndTableUpdates.sql file.
This is run as the _OPER account, and you need to do a
search/replace ofmxmc_ with <prefix>_ if your instance was not created
using the default mxmc name.
You can revert back to the original SP8 procedures by installing
the contents of "set procs original.sql" if there's any
other issues. If you prefer to have us online with you during this process
please let us know and we'll set up a conference call session with desktop
sharing.
Best regards
Per Christian Krabsetsve
SAP Labs Norway
-------------------------------------------------------------------------------------
applied the fix and are now getting the following error once in dispatcher log, and no other job will run.
"interrupted due to invalid semiphore"
received the following error in the system log for about 3 hours and then no other errors after that:
mc_trans_commit_all implicit transaction set off sema:444:1 SPID:83
Using the audit trail in SQL i have identified the failed provisioning tasks and have tried to run them directly to at least see if I get a new error message so that I can identify if there is a config setting in the task/system that is causing failure but when I click the "run now" button it shows the last used time of when I press it, but now no errors are logged in at all. (dispatcher, system, job) and none of the jobs show a running status ever in the status view. it's as if the system does not ever recognize any job/task run.
I did get one final system log of: "1 stale semaphores released" and since then nothing else will run. (4 hours now)
Any help would be appriciated.
At this point the High Sev ticket has been with SAP Support for 5 days, I am not seeing anything on my searches of SCN that seem to be relevant to this release and patch level to try and trouble shoot what may be causing our dispatchers to consistantly fail when they were seeing jobs and now not run anything at all even with the provision queue showing 3400+ items in a pending status.