NFS file locking and it’s affect on Connections

A customer moved the shared storage (NAS) that Connections uses to new hardware. Check out by the testers showed that Connections was functioning correctly but by Monday morning it was apparent there were problems.

Looking at the various SystemOut.logs on monday morning I saw errors such as the following on both the Connections nodes.

[16/06/12 18:25:33:585 BST] 0000004c LocalExceptio E   CNTR0020E: EJB threw an unexpected (non-dered) exception during invocation of method “publishEvent” on bean “BeanId(Files#lc.events.publisher#EventPublisher, null)”. Exception data: com.ibm.lconn.events.internal.invoke.async.AsyncEventInerException: CLFWY0148E: Could not obtain an event connection
at com.ibm.lconn.events.internal.publish.impl.JMSPublisher.publishEvent(JMSPublisher.java9)

Caused by: javax.jms.JMSException: CWSIA0241E: An exception was received during the call to the method JmsManagedConnectionFactoryImpl.createConnection: com.ibm.websphere.sib.exception.SIResourceException: CWSIT0088E: There are currently no messaging engines in bus ConnectionsBus running. Additional failure information: CWSIT0103E: No messaging engine was found that matched the following parameters: bus=ConnectionsBus, targetGroup=null, targetType=BusMember, targetSignificance=Preferred, transpo
rtChain=InboundSecureMessaging, proximity=Bus..
at com.ibm.ws.sib.api.jms.impl.JmsManagedConnectionFactoryImpl.createConnection(JmsManagedConnectionFactoryImpl.java:195)
Caused by: com.ibm.websphere.sib.exception.SIResourceException: CWSIT0088E: There are currently no messaging engines in bus ConnectionsBus running. Additional failure information: CWSIT0103E: No messaging engine was found that matched the following parameters: bus=ConnectionsBus, targetGroup=null, targetType=BusMember, targetSignificance=Preferred, transportChain=InboundSecureMessaging, proximity=Bus.
at com.ibm.ws.sib.trm.client.TrmSICoreConnectionFactoryImpl.localBootstrap(TrmSICoreConnectionFactoryImpl.java:363)

Looking at the SystemOut.log after start up I found that the SibMessages stopped whilst trying to initialise the Log file. The desired output is further down.

Current:
[16/06/12 17:06:21:001 BST] 00000015 SibMessage    I   [ConnectionsBus:dogearcluster.000-ConnectionsBus] CWSIS1581I: The file store is attempting to initalise its log file: /opt/IBM/LotusConnections/data/shared/messageStores/dogearcluster/log/Log

Desired output:
[16/06/12 08:07:48:299 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSID0016I: Messaging engine filescluster.000-ConnectionsBus is in state Starting.
[16/06/12 08:07:48:435 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1581I: The file store is attempting to initalise its log file: /opt/IBM/LotusConnections/data/shared/messageStores/filescluster/log/Log
[16/06/12 08:07:48:552 BST] 000000a1 ObjectManager I   CWSOM0002I: ObjectManager using logFile /opt/IBM/LotusConnections/data/shared/messageStores/filescluster/log/Log was warm started logFileType=FILE.
[16/06/12 08:07:48:676 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1584I: The file store has initialised successfully.
[16/06/12 08:07:48:676 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1585I: The file store is attempting to start its permanent store (PermanentStore) and temporary store (TemporaryStore).
[16/06/12 08:07:48:677 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1586I: The file stores permanent store has been started successfully.
[16/06/12 08:07:48:677 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1587I: The file stores temporary store has been started successfully.
[16/06/12 08:07:48:678 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1559I: The data in the permanent store file occupies 1,434,580 bytes.
[16/06/12 08:07:50:362 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSID0016I: Messaging engine filescluster.000-ConnectionsBus is in state Started.

That evening the contents of the /messageStores/* were deleted as detailed in High number of publication points observed for Connections Files – no notifications. Profiles was started first to check that the new Log file was created successfully but the following errors were seen.

[18/06/12 23:15:02:094 BST] 00000012 SystemOut O at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.utils.Utils.getImpl(Utils.java:44)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.utils.FileLock.getFileLock(FileLock.java:41)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManagerState.<init>(ObjectManagerState.java:408)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManager.createObjectManagerState(ObjectManager.java:293)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManager.initialise(ObjectManager.java:237)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManager.<init>(ObjectManager.java:197)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.msgstore.persistence.objectManager.PersistentMessageStoreImpl.start(PersistentMessageStoreImpl.java:354)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.msgstore.impl.MessageStoreImpl.start(MessageStoreImpl.java:1518)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.admin.impl.JsMessagingEngineImpl.start(JsMessagingEngineImpl.java:609)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.admin.impl.HAManagerMessagingEngineImpl.activate(HAManagerMessagingEngineImpl.java:995)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.admin.impl.JsActivationThread.run(JsActivationThread.java:92)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O Caused by: java.io.IOException: No locks available
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:914)
[18/06/12 23:15:02:096 BST] 00000012 SystemOut O at java.nio.channels.FileChannel.tryLock(FileChannel.java:973)
[18/06/12 23:15:02:096 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.utils.FileLockImpl.<init>(FileLockImpl.java:58)

We then tried starting the other node which did start correctly, the Log files were created and the SystemOut.logs looked good. We then started the affected node and everything checked out.

The following day myself and a colleague asked for some data from the Linux servers and noticed in dmesg the following errors.

statd: server localhost not responding, timed out
lockd: cannot monitor 111.125.65.7
lockd: failed to monitor 111.125.65.7
statd: server localhost not responding, timed out
lockd: cannot monitor 111.125.65.7
lockd: failed to monitor 111.125.65.7
statd: server localhost not responding, timed out
lockd: cannot monitor 111.125.65.7
lockd: failed to monitor 111.125.65.7

The services mentioned are related to NFS and file locking which fits with what we are seeing and the change the customer made over the weekend.

The customer’s Unix team found a match on Redhat’s website which describes the condition and they wanted to remount and restart the nfslock service. Unix needed all applications stopped on the affected node that access the share and they ran the following commands.

/etc/init.d $ umount /net/connections
/etc/init.d $ ./nfslock stop
/etc/init.d $ ./nfslock start
/etc/init.d $ mount /net/connections

I started all the application servers on the affected node and then to test whether file locking was working I stopped two of the applications on the other node so that the affected node took over and thankfully the affected node showed healthy messages in the SystemOut.logs.

[19/06/12 15:48:01:243 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1581I: The file store is attempting to initalise its log file: /opt/IBM/LotusConnections/data/shared/messageStores/forumcluster/log/Log
[19/06/12 15:48:01:406 BST] 00000047 ObjectManager I CWSOM0002I: ObjectManager using logFile /opt/IBM/LotusConnections/data/shared/messageStores/forumcluster/log/Log was warm started logFileType=FILE.
[19/06/12 15:48:01:524 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1584I: The file store has initialised successfully.
[19/06/12 15:48:01:524 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1585I: The file store is attempting to start its permanent store (PermanentStore) and temporary store (TemporaryStore).
[19/06/12 15:48:01:525 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1586I: The file stores permanent store has been started successfully.
[19/06/12 15:48:01:525 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1587I: The file stores temporary store has been started successfully.
[19/06/12 15:48:01:526 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1559I: The data in the permanent store file occupies 1,255,938 bytes.
[19/06/12 15:48:01:526 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1551I: The current minimum reserved size of the permanent store file is 209,715,200 bytes. The current maximum size is 524,288,000 bytes.
[19/06/12 15:48:01:527 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1560I: The data in the temporary store file occupies 8,197 bytes.
[19/06/12 15:48:01:527 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1555I: The current minimum reserved size of the temporary store file is 209,715,200 bytes. The current maximum size is 524,288,000 bytes.
[19/06/12 15:48:01:528 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1550I: The current size of the log file is 104,857,600 bytes. The size in the configuration information of the log file is 104,857,600 bytes.
[19/06/12 15:48:01:532 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1564I: The messaging engine, ME_UUID=392DF5A35DD2FFC2, INC_UUID=71C871C80536E9D9, is attempting to obtain an exclusive lock on the file store.
[19/06/12 15:48:01:533 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1566I: A single previous owner was found in the messaging engine’s file store, ME_UUID=392DF5A35DD2FFC2, INC_UUID=1FE31FE301D82BD2
[19/06/12 15:48:01:542 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1563I: The messaging engine, ME_UUID=392DF5A35DD2FFC2, INC_UUID=71C871C80536E9D9, has acquired an exclusive lock on the file store.
[19/06/12 15:48:01:573 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1588I: The file store has started successfully.
[19/06/12 15:48:02:356 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0212I: Messaging engine forumcluster.000-ConnectionsBus on bus ConnectionsBus is starting to reconcile the WCCM destination and link configuration.
[19/06/12 15:48:02:375 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0213I: Messaging engine forumcluster.000-ConnectionsBus on bus ConnectionsBus has finished reconciling the WCCM destination and link configuration.
[19/06/12 15:48:02:488 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSJZ0021I: WS-ReliableMessaging is enabled for messaging engine forumcluster.000-ConnectionsBus on bus ConnectionsBus.
[19/06/12 15:48:02:542 BST] 00000047 SibMessage W [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0028W: Support for multiple subscribers to subscription commands##forums, on destination connections.platformCommands, has been disabled.
[19/06/12 15:48:02:549 BST] 00000047 SibMessage W [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIV1101W: The MDB maximum concurrency configuration value has been changed from the value 1 to 1.
[19/06/12 15:48:02:554 BST] 00000047 SibMessage W [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0181W: The maximum batch size for the consumer on destination connections.platformCommands, has been set to 1.
[19/06/12 15:48:02:556 BST] 00000047 SibTr$Suppres I CWSIU0006I: Future occurences of “CWSIP0181W: The maximum batch size for the consumer on destination connections.platformCommands, has been set to 1.” will be suppressed.
[19/06/12 15:48:02:557 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIV0777I: A connection to messaging engine forumcluster.000-ConnectionsBus for destination connections.platformCommands on bus ConnectionsBus has been successfully created.
[19/06/12 15:48:02:558 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSID0016I: Messaging engine forumcluster.000-ConnectionsBus is in state Started.

I found the following statement in the Connections wiki, the customer is using NFS v3.

“In a multi-node cluster, you need to configure network share directories for content stores and search indexes. When using NFS, use NFS v4 because NFS v3 lacks advanced locking capability.”

Dave Hay came up against a similar problem in his post “Lotus Connections 2.5.0.1 – Problems with Service Integration Bus after configuring Shared Message Store” which is worth a read too.

High number of publication points observed for Connections Files – no notifications

I raised a PMR with IBM after noticing (whilst investigating another problem) the publication points were at it’s upper most value, 50000.

I also saw the following in Files SystemOut.log

“javax.jms.JMSException: CWSIA0067E: An exception was received during the call to the method JmsMsgProducerImpl.sendMessage (#4): com.ibm.ws.sib.processor.exceptions.SIMPLimitExceededException: CWSIP0251E: The destination connections.events on messaging engine filescluster.000-ConnectionsBus is not available because the high limit 50,000 for the number of messages for this destination has already been reached..”

Having not had any issues reported by the customer I raised a PMR to find out why this had happened and the effect that it may have had. Quite quickly after some testing I found that no notifications were being sent from Files.

What triggered this blog is the publication of a Technote  (created from my PMR) detailing the reason and the permanent fix. One thing the Technote does not cover is how to clear the queues which will DELETE all of the pending messages.

Delete the SIB file stores – just delete everything under the “messageStores” directory. Location can be found under Buses > ConnectionsBus > Messaging engines > <clusters> > File store, ie:

\opt\IBM\LotusConnections\data\shared\messageStores\*

– Delete the WebSphere transaction logs under

<WAS_HOME>\profiles\<AppServers>\tranlog\<cell>\<node>\

– Delete everything under that directory.

– Start the nodes.