Saturday, 28 May 2011

Siebel database (OLTP) blocking


A few days back, I was working on a Siebel application which had the issue of a lot of database blockings and the running tasks frequently reached max tasks.
We took the below steps to pin the problem
Modus operandi:
  1. Configuring a mail to understand the general trend in blocking which will not only give us a possible idea of the cause of blocking but also help us formulate a strategy so that step (b) is effective and is for a short duration
  2. Enabling trace and increasing log level for db and sql events (Costly for the system so should be planned well with the output of step 1 )
  3. Mapping the blocking pid to spid (To find the trace file)
  4. Using the pid and bind variable values to find the Siebel log
  5. Studying the Siebel log to find the problem
A very high level view of Siebel architecture is discussed in the following link:

      1.      Configuring a mail to understand the general trend in blocking which will not only give us a possible idea of the cause of blocking but also help us formulate a strategy so that step (b) is effective and is for a short duration

The following query gives info on the blocking and holding sessions.
We were using the mailx utility of Solaris to send mails after every 5 mins if a blocking existed. The scheduling was done using cron tab.
There are other possible ways to get a mail for this purpose. We can have a subject area in OBIEE and configure iBots to send graphical reports on database health

get the query 
SELECT ses_wait.blocking_session, waiter.SID waiting_session,
       waiter.seconds_in_wait,
       DECODE
          (ses_hold.state,
           'WAITED SHORT TIME', 'Holder did not wait and is currently '
            || ses_hold.status,
           'WAITING', 'Holder is also waiting for the past '
            || ses_hold.seconds_in_wait
            || ' seconds and is currently '
            || ses_hold.status,
           'WAITED UNKNOWN TIME', 'Duration of last wait of holder is unknown and is currently '
            || ses_hold.status,
           'WAITED KNOWN TIME', 'Duration of last wait by the holder is:'
            || ses_hold.state
            || ' and is currently '
            || ses_hold.status
          ) "State of holder",
       sql_wait.sql_text, ses_wait.status "waiting STATUS",
       ses_wait.process "waiting process", ses_hold.process "holding process",
       ses_wait.program "waiting PROGRAM", ses_hold.program "holding PROGRAM",
       ses_wait.row_wait_obj# "waiting object",
       ses_wait.row_wait_row# "waiting row",
       ses_hold.row_wait_obj# "holding object",
       ses_hold.row_wait_row# "holding row",
       ses_wait.username "holding os username",
       ses_hold.username "waiting os username",
       ses_wait.sql_id "sql id of waiter"
  FROM v$session_wait waiter,
       v$session ses_hold,
       v$session ses_wait,
       v$sqlstats sql_wait
 WHERE waiter.SID = ses_wait.SID
   AND ses_wait.blocking_session = ses_hold.SID(+)
   AND ses_wait.sql_id = sql_wait.sql_id
   AND waiter.wait_time = 0
   AND waiter.seconds_in_wait > 60
It is interesting to note that it is easier to get the session state of the waiting oracle session (addressed as the waiter from now on) as the session is always active because it is waiting for the holder(Oracle session which has executed a DML and the transaction is not complete yet) to release the lock. Hence the query running in the waiter session can be easily captured from v$sqlarea, v$sql views.
The blocker on the other hand acquires the lock, executes the DML and then becomes inactive (if it does not have any more sql to execute). Since the blocking session is not active so it is not visible in the v$session view and hence the outer join in the line
ses_wait.blocking_session = ses_hold.SID(+)
The only view which gives some info on the blocker is v$sql_bind_capture, but this view should be used with caution. A more detail discussion on this topic is on the following post

After analyzing the blocking mails(configured in step 1 above) for a period of time, we came to know that our problem was a TX enqueue contention in mode ‘X’. This exclusive lock is acquired to lock a row for manipulation.
The other less common type of contention was TX enqueue in mode ‘S’. There are a number of reasons for this
  1. Lock in S mode is acquired to put an entry into the ITL (Interested Transaction List) of the block which will be edited by the current statement. If we find this to be the problem then, we can handle it by altering the INITRANS parameter of the object on which this lock is being taken. MAXTRANS is another parameter which defines the maximum number of sessions which can put an entry into the ITL.
  2. Two sessions trying to insert same value into a unique index.
  3. Two sessions updating the same bitmap index fragment. Each if such bitmap index can cover a huge range of values and this could be a serious trouble for OLTP systems and hence we rarely have this index in OLTP systems

It is important to understand here that there are many types of oracle enqueues which includes locks, latches. The following post discusses this in detail
The following table lists the types of these contentions and the statements which lead to these locks

2.      Enabling trace of the database and increasing log level for db and sql events and waiting for a blocking mail

The Siebel event names can be found by using the following commands after connecting to the server manager
list evtloglvl %SQL% for server <server_host_name> comp eCommunicationsObjMgrClone_enu show ET_ALIAS, EC_LOGHNDL_LVL, ET_NAME
list evtloglvl %DBC% for server <server_host_name> comp eCommunicationsObjMgrClone_enu show ET_ALIAS, EC_LOGHNDL_LVL, ET_NAME
Increasing the log level for all events will make the logs very huge and difficult to study
The trace at the database should also be enabled. We are interested in 10046 event at  level 4 or level 12 so that we get the bind variable info. Bind variables will help us find the Siebel log for the blocker.
Db trace can be enabled by using the ‘alter system set event’ command.
The number of trace logs and Siebel logs on a typical live system will be huge, even for a short duration. We have to come up with a strategy so that the duration for keeping the log levels high, is minimized and we manage to catch a blocking session .We can use the mails configured above to establish some trends

3.      Mapping the blocking pid to spid (To find the trace file)

We will get a sid of the blocking oracle session in the mail we configured. This sid can be used to find the spid of the database server process which is executing the blocking DML
The following query can help us in getting the spid

get the query
      SELECT pro.spid
      FROM v$process pro, v$session ses
      WHERE pro.addr = ses.paddr
      AND ses.audsid = :sid_from_mail

This spid is a part of the trace file generated at the directory referenced by USER_DUMP_DEST on the database server
Once the trace file has been found out using the spid, we have to find the blocking query. 
 
We have to find the blocker Siebel log by comparing the bind variables in the blocker db trace file with the Siebel log because we are unable to capture the thread id in solaris Siebel installation.
Since Siebel application updates rows using their rowid and only a single row is manipulated at a time, the holder query is generally the same as the waiter query.
The waiter query is available in the mail. We have to search for this query in the holder trace file. If we find multiple runs of this query with different bind variable values then we should narrow down on the basis of the time at which the blocking occurs. This info can again be concluded from the mail.
We can cross verify that we have picked the correct holding trace file by comparing the object which is being manipulated in the query with the object for which the waiter is waiting. The object for which the waiter is waiting can be found out from waiter’s trace file. Waiter’s trace file is found by using the same mechanism described above of using the spid (This time of waiter).

       4.      Using the pid and bind variable values to find the Siebel log

In case of Siebel windows installations, the process column in v$session captures the thread id of the client along with its process id, <process_id>:<thread_id> and same information is also available in the head of the ecommunication log. Hence finding the blocker Siebel log is a lot easier in windows
      For eg: 16416 is the process id and 490 is the thread id in the below screenshot of one such log

 
We will first narrow down the number of logs by the following query

get the query
select srvr_logfile_name
from siebel.s_srm_task_hist
where srvr_proc_id_val = :process_id_of_holder
and srvr_name = :Siebel_server_host_name
and created < to_date(:time_of_blocking,’dd-mon-yyyy hh24:mi:ss’)

We then have to search for the bind variable values from the trace file of holder in these selected logs. We can use the unix ‘find command’ to search for this

5.    Studying the Siebel log to find the problem
Once we have identified the problematic ecommunication log, we can then study it to find the problem

No comments: