Troubleshooting work queues

It is possible for a work queue worker to encounter a problem that causes it to fail, before the worker completes all the items it checked out from the queue. For example, it is possible for a server to die, killing its workers in the middle of processing. This action can result in orphan work items. A work item becomes an orphan if a worker has an item checked out but does not complete processing the item within an allotted amount of time. The progressinterval attribute on the <worker> element in work-queue.xml defines this time span.

Workers treat orphans just as they do available items. The next worker that encounters the orphan item in the table adopts it for processing and resets the LastUpdateTime, CheckedOutBy, and Status fields on the orphan work item.

If a work queue is experiencing a large number of orphans, review log files to locate timeouts during processing. For example, a timeout can occur while a worker waits for an external server to return a value. If the log contains these type of timeouts, increase the progressinterval value for the work queue in work-queue.xml to give workers more processing time.

Sometimes, a problem inherent in the item itself causes the processing of the item to fail. For example, a batch process throws an exception. In such cases, the worker stops processing the item and goes on to the next. The item becomes an orphan and the next worker attempts to process it. In this way, a work queue attempts to process each item multiple times up to a limit configured for the work queue. If a work item exceeds the limit of processing attempts, PolicyCenter changes the status of the work item to failed. Workers ignore items with a status of failed and no longer attempt to process them.

See also