IBM Connections Mail not working due to Domino view oddness

I’m sure I could have come up with a better title but I’m not sure how else to put it.

Prior to going live with an internal Connections 5.5 deployment my colleagues in India were testing Connections and they kept getting the following error appear on each page in Connections.

"You are no longer logged in. Click OK to discard your current work and go to the log in screen...."

1

Having seen this in customer environments in the past I knew it was due to IBM Connections  Mail but I didn’t know why.

I had the user open up (in a new tab in the same browser) the URL for iNotes and he got the following error.

"CN=****** you have insufficient rights for /mail/***.nsf. Please login with a username and password which has sufficient rights."

2

SSO has been set up correctly and the configuration is the same for everyone. Those in the UK work fine.

I compared the DistinguishedName in AD (as Connections uses AD for it’s LDAP) and the OU my colleagues in India use differs to those in the UK. I noticed that there was a double space between the words in one of the India OUs. That was the only difference between the two sets of users.

I checked the value in the user’s person document, Administration tab and LTPA user name field and it showed correctly ie it had the double spaces in it.

My colleague looked at all the users connect to the iNotes server. For me it showed my Domino format name ie Ben Williams/Something/Org but for the problematic user and his colleagues it showed his AD name still. So name resolution wasn’t working.

We scratched our heads and then I remembered an old problem for a customer (not related) and had my colleague open the address book and we looked in the $USERS view. In there we saw the user but the DN did not have the double space but a single space. That would explain why the AD DN didn’t resolve to the Domino hierarchical name.

When my colleague attempted to paste the AD DN into the user name field of his person document and save the change we saw that the text “moved” removing the additional space! I Googled, looked at the old Domino Technote database and the APAR support website but I couldn’t find anything to describe why this would happen.

In the end I spoke with our AD guys and they updated the OU removing the extra space. Then we updated the LTPA user name field (just to keep things clean) and our brethren in our India office could use IBM Connections Mail.

HOMEPAGE.SR_RESUME_TOKENS duplicate data in IBM Connections – proper fix

I wrote a post, HOMEPAGE.SR_RESUME_TOKENS duplicate data in IBM Connections, where I work around the problem by clearing the contents of SR_RESUME_TOKENS. I found that every restart of the JVM hosting Search caused more rows to be added to the table. I raised a PMR and IBM came back and told me that others have raised the same problem and it is due to the fact that constraints are missing. The missing constraints should have been added during the “post” migration process to reapply the constraints after using dbt.jar.

My constraints looked like this:

constraints2

Whilst they should have looked like this:

constraints1

I stopped the JVM hosting Search and ran the following DB2 queries

db2 “DELETE FROM HOMEPAGE.SR_RESUME_TOKENS WHERE NODE_ID = ‘xxxxxNode01:InfraCluster_server1′”
db2 “ALTER TABLE HOMEPAGE.SR_RESUME_TOKENS ADD CONSTRAINT “PK_TOKEN_ID” PRIMARY KEY (“TOKEN_ID”)”
DB21034E  The command was processed as an SQL statement because it was not a
valid Command Line Processor command.  During SQL processing it returned:
db2 “ALTER TABLE HOMEPAGE.SR_RESUME_TOKENS ADD CONSTRAINT “FK_RT_IDX_MGMT_ID” FOREIGN KEY (“NODE_ID”) REFERENCES HOMEPAGE.SR_INDEX_MANAGEMENT(“NODE_ID”) ON DELETE CASCADE”
DB20000I  The SQL command completed successfully.
db2 “RUNSTATS ON TABLE HOMEPAGE.SR_RESUME_TOKENS”
DB20000I  The RUNSTATS command completed successfully.
db2 “RUNSTATS ON TABLE HOMEPAGE.SR_RESUME_TOKENS FOR INDEXES ALL”
DB20000I  The RUNSTATS command completed successfully.

On restarting the Search JVM a number of times I found that only one row was created for each application and not multiple as I found previously.

Thanks IBM 🙂

HOMEPAGE.SR_RESUME_TOKENS duplicate data in IBM Connections

I was checking things after migrating IBM Connections from version 4.0 to 5.5 and found the following error in the application server hosting Search. It didn’t stop the search index and returning results.

[11/18/16 18:46:00:604 GMT] 000001ba XmlBeanDefini I org.springframework.beans.factory.xml.XmlBeanDefinitionReader loadBeanDefinitions Loading XML bean definitions from class path resource [org/springframework/jdbc/support/sql-error-codes.xml]
[11/18/16 18:46:00:627 GMT] 000001ba SQLErrorCodes I org.springframework.jdbc.support.SQLErrorCodesFactory <init> SQLErrorCodes loaded: [DB2, Derby, H2, HSQL, Informix, MS-SQL, MySQL, Oracle, PostgreSQL, Sybase]
[11/18/16 18:46:00:645 GMT] 000001ba IndexingTaskB W com.ibm.connections.search.ejbs.indexing.IndexingTaskBean processTask CLFRW0395E: An error occurred while running the scheduled indexing task named 15min-search-indexing-task.
                                 com.ibm.connections.search.admin.index.exception.IndexingTaskException: org.springframework.jdbc.UncategorizedSQLException: SqlMapClient operation; uncategorized SQLException for SQL []; SQL state [null]; error code [0]; Error: executeQueryForObject returned too many results.; nested exception is java.sql.SQLException: Error: executeQueryForObject returned too many results.

I Googled “returned too many results” and it hinted at duplicate data in databases for different IBM products. Hmmm.

I enabled the following trace and ran a one of indexing task, SearchService.indexNow(“all_configured”)

com.ibm.connections.search.index.indexing.*=all: com.ibm.connections.search.seedlist.*=all: com.ibm.connections.httpClient.*=all

In trace.log I saw more information and just prior to the database exception I saw resume token messages

[11/18/16 18:46:00:580 GMT] 000001ba ResumeTokenIn > com.ibm.connections.search.seedlist.crawler.util.ResumeTokenInterpreter getInitialResumeToken ENTRY wikis
[11/18/16 18:46:00:580 GMT] 000001ba ResumeTokenIn > com.ibm.connections.search.seedlist.crawler.util.ResumeTokenInterpreter resumeTokenFromDate ENTRY Thu Jan 01 01:00:00 GMT 1970 wikis
[11/18/16 18:46:00:580 GMT] 000001ba ResumeTokenIn < com.ibm.connections.search.seedlist.crawler.util.ResumeTokenInterpreter resumeTokenFromDate RETURN AAAAAAAAAAA=
[11/18/16 18:46:00:580 GMT] 000001ba ResumeTokenIn < com.ibm.connections.search.seedlist.crawler.util.ResumeTokenInterpreter getInitialResumeToken RETURN AAAAAAAAAAA=

Resume tokens and references to duplicate data in the database, hmmm. Well HOMEPAGE has the SR_RESUME_TOKENS table. I opened it in dbVisualizer and saw this.

resumetoken2

It didn’t look right and compared it with other deployments and found that others only have the one row per application. The knowledge center details how to manipulate them but not clear them.

I shut down all application servers and backed up HOMEPAGE database. I then cleared the table

# su – db2inst1
$ cd /opt2/db2backups/55_homepage_resumetokens/homepage/
$ db2 backup db homepage to ‘/opt2/db2backups/55_homepage_resumetokens/homepage/’
$ db2 connect to homepage
$ db2 “DELETE FROM HOMEPAGE.SR_RESUME_TOKENS WHERE NODE_ID = ‘*****Node01:InfraCluster_server1′”
$ db2 connect reset

On startup the errors have gone and there is only one row per application.

Active users showing as inactive in All Connections search

A customer was seeing some users marked as inactive when using the All Connections search but when clicking through to the user’s profile they were active and active in communities and all over areas of Connections.

Looking into the database tables I found that the “state” of these users were correct, for example, in the EMPINST.GIVEN_NAME a particular user had a PROF_USRSTATE equalling 0 which means he’s active. In the EMPINST.EMPLOYEE table affected users had their email addresses which are normally removed when they are made inactive.

After some investigation I found that by simply activating them would mark them as active without any changes to the various tables in PEOPLEDB.

This got me thinking that the problem was an index issue and without knowing how many people were affected I suggested that the customer recreate the index. I provided them with steps of how to back it up, delete it from the file system and create a new one but even after the index created users were still showing as inactive.

Thankfully I had access to the Control Center and decided to look at all the PEOPLEDB tables, none were useful. I then started looking at the next logical database, HOMEPAGE. Interestingly, in the HOMEPAGE.PERSON table there is a column called STATE and the affected users had a value of 1 in that column. Running the following command changed the STATE to 0 and then searching for the user using the All Connections search showed him as active.

wsadmin.bat -lang jython -port 8879
execfile(“D:\IBM\WebSphere\AppServer\profiles\AppSrv01\config\bin_lc_admin\profilesAdmin.py”)
ProfilesService.activateUserByUserId(“E4BB9E9D-43D3-B5A4-8025-7433003EFACB”,email=”ben.williams@acme.com”, displayName=”Ben Williams”)

Going further I had to identify how many users were affected and the below query gave me the column values I needed to activate users who were marked inactive.

SELECT PERSON.DISPLAYNAME, PERSON.EXID, PERSON.USER_MAIL_LOWER FROM HOMEPAGE.PERSON AS PERSON WHERE PERSON.USER_MAIL_LOWER IS  NOT  NULL  AND PERSON.STATE = 1

The above query helped but there were still a number of users that were not in HOMEPAGE.PERSON and are in PEOPLEDB. These people were showing as inactive in the All Connections search BUT had never logged into Connections and hence their email addresses had not populated the HOMEPAGE database. These I had cross referenced manually as I don’t have the know how to build a query over different databases 😦

There is a bit of history here. The customer is importing users manually via populate_from dn_file because they want to control who is being added until their Connections 4 environment has been signed off for production and a custom TDI assembly line has been created. A few months ago sync_all_dns was run accidentally which meant that a 1000 or so users had to be identified and then removed from Connections. I believe that this (in some) way caused these problems.

More NFS file locking and Connections

So I posted a blog NFS file locking and it’s affect on Connections not long ago so when I came across a similar problem with my lab deployment of Connections 3.0.1 on CentOS I thought it would be easy to sort out…..

I was seeing the following exception in the systemOut.log:

[7/9/12 22:27:51:172 BST] 00000022 SystemOut     O java.lang.reflect.InvocationTargetException
[7/9/12 22:27:51:173 BST] 00000022 SystemOut     O      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[7/9/12 22:27:51:173 BST] 00000022 SystemOut     O      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
[7/9/12 22:27:51:174 BST] 00000022 SystemOut     O      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
[7/9/12 22:27:51:174 BST] 00000022 SystemOut     O      at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
[7/9/12 22:27:51:175 BST] 00000022 SystemOut     O      at com.ibm.ws.objectManager.utils.Utils.getImpl(Utils.java:44)
[7/9/12 22:27:51:175 BST] 00000022 SystemOut     O      at com.ibm.ws.objectManager.utils.FileLock.getFileLock(FileLock.java:41)
[7/9/12 22:27:51:176 BST] 00000022 SystemOut     O      at com.ibm.ws.objectManager.ObjectManagerState.<init>(ObjectManagerState.java:408)
[7/9/12 22:27:51:176 BST] 00000022 SystemOut     O      at com.ibm.ws.objectManager.ObjectManager.createObjectManagerState(ObjectManager.java:293)
[7/9/12 22:27:51:176 BST] 00000022 SystemOut     O      at com.ibm.ws.objectManager.ObjectManager.initialise(ObjectManager.java:237)
[7/9/12 22:27:51:177 BST] 00000022 SystemOut     O      at com.ibm.ws.objectManager.ObjectManager.<init>(ObjectManager.java:197)
[7/9/12 22:27:51:177 BST] 00000022 SystemOut     O      at com.ibm.ws.sib.msgstore.persistence.objectManager.PersistentMessageStoreImpl.start(PersistentMessageStoreImpl.java:354)
[7/9/12 22:27:51:178 BST] 00000022 SystemOut     O      at com.ibm.ws.sib.msgstore.impl.MessageStoreImpl.start(MessageStoreImpl.java:1518)
[7/9/12 22:27:51:178 BST] 00000022 SystemOut     O      at com.ibm.ws.sib.admin.impl.JsMessagingEngineImpl.start(JsMessagingEngineImpl.java:609)
[7/9/12 22:27:51:178 BST] 00000022 SystemOut     O      at com.ibm.ws.sib.admin.impl.HAManagerMessagingEngineImpl.activate(HAManagerMessagingEngineImpl.java:995)
[7/9/12 22:27:51:179 BST] 00000022 SystemOut     O      at com.ibm.ws.sib.admin.impl.JsActivationThread.run(JsActivationThread.java:92)
[7/9/12 22:27:51:179 BST] 00000022 SystemOut     O Caused by: java.io.IOException: No locks available
[7/9/12 22:27:51:180 BST] 00000022 SystemOut     O      at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:914)
[7/9/12 22:27:51:180 BST] 00000022 SystemOut     O      at java.nio.channels.FileChannel.tryLock(FileChannel.java:973)
[7/9/12 22:27:51:181 BST] 00000022 SystemOut     O      at com.ibm.ws.objectManager.utils.FileLockImpl.<init>(FileLockImpl.java:58)
[7/9/12 22:27:51:181 BST] 00000022 SystemOut     O      … 15 more

I tried the steps in my previous blog to no avail. I also edited /etc/fstab on the client appending “lock” as follows.

xxx.xxx.x.xx:/opt/IBM/LotusConnections/shared /opt/IBM/LotusConnections/data/shared nfs rw,lock,hard,intr 0 0

I re-read some CentOS documentation picking up on the following:

“nfslock also has to be started for both the NFS client and server to function properly. To start NFS locking as root type: /sbin/service nfslock start. If NFS is set to start at boot, please ensure that nfslock also starts by running chkconfig –list nfslock. If nfslock is not set to on, this implies that you will need to manually run the /sbin/service nfslock start each time the computer starts. To set nfslock to automatically start on boot, type the following command in a terminal chkconfig nfslock on. ”

After enabling nfslock on the two clients and the server as well as setting it to start at boot I was up and running again. Odd thing was that this was running fine up until a prolonged shut down of my servers.

Missing status updates in Connections

A customer had a particularly difficult and awkward problem. A number of user reported that they could not see the status updates of others users who were part of the same network.

This seemed to affect a handful of users in the following pattern. User A cannot see User B’s status updates whilst User C can see User B’s updates. The problem is associated with User B. To c0mplicate things further User A shares a network with User B but does not follow him whilst User C also shares a network but does follow User B. So if you follow User B you get to see their status updates in your Status Updates. You should not need to follow the user to see their status updates the fact that you’re sharing a network should be sufficient.

I logged a PMR which thankfully found it’s way into the ever helpful hands of David McCarthy. An ISSC colleague mentioned that this had been seen in IBM’s deployment of Connections (w3) by Luiz Benietz and others so getting to the bottom of this would kill two birds with one stone.

IBM had me extract data from PEOPLEDB and HOMEPAGE databases using the DB2 commands below to provide the PERSON_ID of the users within HOMEPAGE to make sure it matches with the PROF_GUID within PEOPLEDB which for User B is E833339D-AED2-425F-8600-64CEFD85A3A5.

Comparison of NR_NETWORK showed that the data in there was correct.

PEOPLEDB.EMPLOYEE
PROF_GUID = E833339D-AED2-425F-8600-64CEFD85A3A5

HOMEPAGE.PERSON
PERSON_ID = 0bdd7880-8d9a-4fe1-ab13-77c54bb429bc
EXID = E833339D-AED2-425F-8600-64CEFD85A3A5

Using the PERSON_ID I was then able to retrieve data from the HOMEPAGE.NR_NEWS_STATUS_NETWORK table which details all the status updates posted by that PERSON_ID which in the HOMEPAGE.NR_NEWS_STATUS_NETWORK table is identified by ACTOR_UUID. This table also shows the READER_ID (detailed below) as well as a BRIEF_DESC column which shows the first 200 (or so) characters of the status update.

db2 “select * from HOMEPAGE.NR_NEWS_STATUS_NETWORK where ACTOR_UUID = ‘0bdd7880-8d9a-4fe1-ab13-77c54bb429bc‘”

HOMEPAGE.NR_NEWS_STATUS_NETWORK
ACTOR_UUID = 0bdd7880-8d9a-4fe1-ab13-77c54bb429bc
READER_ID = PERSON_ID of the person that is consuming the status updates ie those listed in the READER_ID column should be able to see the status updates.

To find out the names of the people who can read the updates you will need to use the READER_ID from HOMEPAGE.NR_NEWS_STATUS_NETWORK and find out the name of the user from the HOMEPAGE.PERSON by using the READER_ID converted to PERSON_ID.

db2 “select DISPLAYNAME from HOMEPAGE.PERSON where PERSON_ID = ‘714fbcd9-1740-477b-972d-a6456a35e4ea'”

I contacted the user and they could indeed see User B’s status updates.

To see how many people User B has in their network the following command was run which produced a value of about 140.

db2 “select count(*) from HOMEPAGE.NR_NETWORK where PERSON_ID = ‘0bdd7880-8d9a-4fe1-ab13-77c54bb429bc‘”

IBM asked for the following trace to be enabled whilst User B posted a status update and in the trace.log an exception was seen during the processing of one of the threads.

*=info: java.sql.*=all: com.ibatis.sqlmap.*=all: com.ibm.lconn.homepage.*=all: com.ibm.lconn.news.*=all

It was suggested that 3.0.1.1 be installed but that would not be possible. The following fixes were applied which aimed to improve that code path to handle the case where the use records in the Homepage database are out of sync.

3.0.1.0     LO63965     VBUT8LNH89     Enhancement to News aggregation service for better handling of exceptional cases where user records become inconsistent with Profiles

3.0.1.0     LO66468     SCRD8QCDDK     Profiles Sync issue. Too many requests made on Event consumption to Profiles colleague feed

Unfortunately the problem was still apparent after the iFixes so more debugging was enabled, this time com.ibm.lconn.news.*=all: com.ibm.lconn.events.*=all

Thankfully this time the trace.log provided much more valuable information.

[30/05/12 16:14:59:301 BST] 00000067 StatusUpdateS 3 com.ibm.lconn.news.service.impl.StatusUpdateService insertStatusUpdate Going to insert into db with a batch process this number of records: 144 and BUCKET_SIZE_BATCH_INSERT: 100
[ … snip …]
[30/05/12 16:14:59:308 BST] 00000067 NewsStatusNet 3 com.ibm.lconn.news.data.dao.impl.ibatis.NewsStatusNetworkDao insertBatch – doInSqlMapClient rowsaffected 100
[30/05/12 16:14:59:308 BST] 00000067 NewsStatusNet < com.ibm.lconn.news.data.dao.impl.ibatis.NewsStatusNetworkDao insertBatch RETURN 100
[30/05/12 16:14:59:312 BST] 00000067 StatusUpdateS < com.ibm.lconn.news.service.impl.StatusUpdateService insertBatch RETURN 100
[ Here – it’s missing the insert for the remaining 44 records ]
[30/05/12 16:14:59:312 BST] 00000067 StatusUpdateS 3 com.ibm.lconn.news.service.impl.StatusUpdateService insertStatusUpdate Going to insert into db with a batch process this number of records: 3 and BUCKET_SIZE_BATCH_INSERT: 100″

The response from L3 was:

The root cause is a coding error in a routine performing batch insert. The issue should be reproducible for people in the following conditions:

  • The number of people in the intersection of the sets (colleagues, followers)  for the author of the status update is > 100
  • The number of colleagues – number of people in the intersection is > 100 (this is what happen for this very specific PMR)
  • The number of followers – number of people in the intersection is > 100.

So the problem we are seeing is because the user had more than 100 people in their network and/or followers and the code would not add the 44 others to the database. pressing further I asked for an iFix for this which is too complicated and the only option is to go to 3.0.1.1 CR1.

“Upon further review – the issue appears to be indirectly fixed in 3.0.1.1….We performed a code review in 3.0.1.1 – this code path was entirely re-written. The final section of the batch is inserted in the new code in 3.0.1.1.”

Up until this point the problem still hasn’t been resolved as updating Connections has not been planned yet but IBM support are fairly confident that 3.0.1.1 CR1 will iron out this. With regards to w3, David McCarthy was able to track down one user who reported the problem initially who now (after the upgrade) can see all status updates.

NFS file locking and it’s affect on Connections

A customer moved the shared storage (NAS) that Connections uses to new hardware. Check out by the testers showed that Connections was functioning correctly but by Monday morning it was apparent there were problems.

Looking at the various SystemOut.logs on monday morning I saw errors such as the following on both the Connections nodes.

[16/06/12 18:25:33:585 BST] 0000004c LocalExceptio E   CNTR0020E: EJB threw an unexpected (non-dered) exception during invocation of method “publishEvent” on bean “BeanId(Files#lc.events.publisher#EventPublisher, null)”. Exception data: com.ibm.lconn.events.internal.invoke.async.AsyncEventInerException: CLFWY0148E: Could not obtain an event connection
at com.ibm.lconn.events.internal.publish.impl.JMSPublisher.publishEvent(JMSPublisher.java9)

Caused by: javax.jms.JMSException: CWSIA0241E: An exception was received during the call to the method JmsManagedConnectionFactoryImpl.createConnection: com.ibm.websphere.sib.exception.SIResourceException: CWSIT0088E: There are currently no messaging engines in bus ConnectionsBus running. Additional failure information: CWSIT0103E: No messaging engine was found that matched the following parameters: bus=ConnectionsBus, targetGroup=null, targetType=BusMember, targetSignificance=Preferred, transpo
rtChain=InboundSecureMessaging, proximity=Bus..
at com.ibm.ws.sib.api.jms.impl.JmsManagedConnectionFactoryImpl.createConnection(JmsManagedConnectionFactoryImpl.java:195)
Caused by: com.ibm.websphere.sib.exception.SIResourceException: CWSIT0088E: There are currently no messaging engines in bus ConnectionsBus running. Additional failure information: CWSIT0103E: No messaging engine was found that matched the following parameters: bus=ConnectionsBus, targetGroup=null, targetType=BusMember, targetSignificance=Preferred, transportChain=InboundSecureMessaging, proximity=Bus.
at com.ibm.ws.sib.trm.client.TrmSICoreConnectionFactoryImpl.localBootstrap(TrmSICoreConnectionFactoryImpl.java:363)

Looking at the SystemOut.log after start up I found that the SibMessages stopped whilst trying to initialise the Log file. The desired output is further down.

Current:
[16/06/12 17:06:21:001 BST] 00000015 SibMessage    I   [ConnectionsBus:dogearcluster.000-ConnectionsBus] CWSIS1581I: The file store is attempting to initalise its log file: /opt/IBM/LotusConnections/data/shared/messageStores/dogearcluster/log/Log

Desired output:
[16/06/12 08:07:48:299 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSID0016I: Messaging engine filescluster.000-ConnectionsBus is in state Starting.
[16/06/12 08:07:48:435 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1581I: The file store is attempting to initalise its log file: /opt/IBM/LotusConnections/data/shared/messageStores/filescluster/log/Log
[16/06/12 08:07:48:552 BST] 000000a1 ObjectManager I   CWSOM0002I: ObjectManager using logFile /opt/IBM/LotusConnections/data/shared/messageStores/filescluster/log/Log was warm started logFileType=FILE.
[16/06/12 08:07:48:676 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1584I: The file store has initialised successfully.
[16/06/12 08:07:48:676 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1585I: The file store is attempting to start its permanent store (PermanentStore) and temporary store (TemporaryStore).
[16/06/12 08:07:48:677 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1586I: The file stores permanent store has been started successfully.
[16/06/12 08:07:48:677 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1587I: The file stores temporary store has been started successfully.
[16/06/12 08:07:48:678 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSIS1559I: The data in the permanent store file occupies 1,434,580 bytes.
[16/06/12 08:07:50:362 BST] 000000a1 SibMessage    I   [ConnectionsBus:filescluster.000-ConnectionsBus] CWSID0016I: Messaging engine filescluster.000-ConnectionsBus is in state Started.

That evening the contents of the /messageStores/* were deleted as detailed in High number of publication points observed for Connections Files – no notifications. Profiles was started first to check that the new Log file was created successfully but the following errors were seen.

[18/06/12 23:15:02:094 BST] 00000012 SystemOut O at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.utils.Utils.getImpl(Utils.java:44)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.utils.FileLock.getFileLock(FileLock.java:41)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManagerState.<init>(ObjectManagerState.java:408)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManager.createObjectManagerState(ObjectManager.java:293)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManager.initialise(ObjectManager.java:237)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.ObjectManager.<init>(ObjectManager.java:197)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.msgstore.persistence.objectManager.PersistentMessageStoreImpl.start(PersistentMessageStoreImpl.java:354)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.msgstore.impl.MessageStoreImpl.start(MessageStoreImpl.java:1518)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.admin.impl.JsMessagingEngineImpl.start(JsMessagingEngineImpl.java:609)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.admin.impl.HAManagerMessagingEngineImpl.activate(HAManagerMessagingEngineImpl.java:995)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at com.ibm.ws.sib.admin.impl.JsActivationThread.run(JsActivationThread.java:92)
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O Caused by: java.io.IOException: No locks available
[18/06/12 23:15:02:095 BST] 00000012 SystemOut O at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:914)
[18/06/12 23:15:02:096 BST] 00000012 SystemOut O at java.nio.channels.FileChannel.tryLock(FileChannel.java:973)
[18/06/12 23:15:02:096 BST] 00000012 SystemOut O at com.ibm.ws.objectManager.utils.FileLockImpl.<init>(FileLockImpl.java:58)

We then tried starting the other node which did start correctly, the Log files were created and the SystemOut.logs looked good. We then started the affected node and everything checked out.

The following day myself and a colleague asked for some data from the Linux servers and noticed in dmesg the following errors.

statd: server localhost not responding, timed out
lockd: cannot monitor 111.125.65.7
lockd: failed to monitor 111.125.65.7
statd: server localhost not responding, timed out
lockd: cannot monitor 111.125.65.7
lockd: failed to monitor 111.125.65.7
statd: server localhost not responding, timed out
lockd: cannot monitor 111.125.65.7
lockd: failed to monitor 111.125.65.7

The services mentioned are related to NFS and file locking which fits with what we are seeing and the change the customer made over the weekend.

The customer’s Unix team found a match on Redhat’s website which describes the condition and they wanted to remount and restart the nfslock service. Unix needed all applications stopped on the affected node that access the share and they ran the following commands.

/etc/init.d $ umount /net/connections
/etc/init.d $ ./nfslock stop
/etc/init.d $ ./nfslock start
/etc/init.d $ mount /net/connections

I started all the application servers on the affected node and then to test whether file locking was working I stopped two of the applications on the other node so that the affected node took over and thankfully the affected node showed healthy messages in the SystemOut.logs.

[19/06/12 15:48:01:243 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1581I: The file store is attempting to initalise its log file: /opt/IBM/LotusConnections/data/shared/messageStores/forumcluster/log/Log
[19/06/12 15:48:01:406 BST] 00000047 ObjectManager I CWSOM0002I: ObjectManager using logFile /opt/IBM/LotusConnections/data/shared/messageStores/forumcluster/log/Log was warm started logFileType=FILE.
[19/06/12 15:48:01:524 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1584I: The file store has initialised successfully.
[19/06/12 15:48:01:524 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1585I: The file store is attempting to start its permanent store (PermanentStore) and temporary store (TemporaryStore).
[19/06/12 15:48:01:525 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1586I: The file stores permanent store has been started successfully.
[19/06/12 15:48:01:525 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1587I: The file stores temporary store has been started successfully.
[19/06/12 15:48:01:526 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1559I: The data in the permanent store file occupies 1,255,938 bytes.
[19/06/12 15:48:01:526 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1551I: The current minimum reserved size of the permanent store file is 209,715,200 bytes. The current maximum size is 524,288,000 bytes.
[19/06/12 15:48:01:527 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1560I: The data in the temporary store file occupies 8,197 bytes.
[19/06/12 15:48:01:527 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1555I: The current minimum reserved size of the temporary store file is 209,715,200 bytes. The current maximum size is 524,288,000 bytes.
[19/06/12 15:48:01:528 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1550I: The current size of the log file is 104,857,600 bytes. The size in the configuration information of the log file is 104,857,600 bytes.
[19/06/12 15:48:01:532 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1564I: The messaging engine, ME_UUID=392DF5A35DD2FFC2, INC_UUID=71C871C80536E9D9, is attempting to obtain an exclusive lock on the file store.
[19/06/12 15:48:01:533 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1566I: A single previous owner was found in the messaging engine’s file store, ME_UUID=392DF5A35DD2FFC2, INC_UUID=1FE31FE301D82BD2
[19/06/12 15:48:01:542 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1563I: The messaging engine, ME_UUID=392DF5A35DD2FFC2, INC_UUID=71C871C80536E9D9, has acquired an exclusive lock on the file store.
[19/06/12 15:48:01:573 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIS1588I: The file store has started successfully.
[19/06/12 15:48:02:356 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0212I: Messaging engine forumcluster.000-ConnectionsBus on bus ConnectionsBus is starting to reconcile the WCCM destination and link configuration.
[19/06/12 15:48:02:375 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0213I: Messaging engine forumcluster.000-ConnectionsBus on bus ConnectionsBus has finished reconciling the WCCM destination and link configuration.
[19/06/12 15:48:02:488 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSJZ0021I: WS-ReliableMessaging is enabled for messaging engine forumcluster.000-ConnectionsBus on bus ConnectionsBus.
[19/06/12 15:48:02:542 BST] 00000047 SibMessage W [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0028W: Support for multiple subscribers to subscription commands##forums, on destination connections.platformCommands, has been disabled.
[19/06/12 15:48:02:549 BST] 00000047 SibMessage W [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIV1101W: The MDB maximum concurrency configuration value has been changed from the value 1 to 1.
[19/06/12 15:48:02:554 BST] 00000047 SibMessage W [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIP0181W: The maximum batch size for the consumer on destination connections.platformCommands, has been set to 1.
[19/06/12 15:48:02:556 BST] 00000047 SibTr$Suppres I CWSIU0006I: Future occurences of “CWSIP0181W: The maximum batch size for the consumer on destination connections.platformCommands, has been set to 1.” will be suppressed.
[19/06/12 15:48:02:557 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSIV0777I: A connection to messaging engine forumcluster.000-ConnectionsBus for destination connections.platformCommands on bus ConnectionsBus has been successfully created.
[19/06/12 15:48:02:558 BST] 00000047 SibMessage I [ConnectionsBus:forumcluster.000-ConnectionsBus] CWSID0016I: Messaging engine forumcluster.000-ConnectionsBus is in state Started.

I found the following statement in the Connections wiki, the customer is using NFS v3.

“In a multi-node cluster, you need to configure network share directories for content stores and search indexes. When using NFS, use NFS v4 because NFS v3 lacks advanced locking capability.”

Dave Hay came up against a similar problem in his post “Lotus Connections 2.5.0.1 – Problems with Service Integration Bus after configuring Shared Message Store” which is worth a read too.