The customer told me that his search index never completed correctly when Connections was initially deployed and now users are complaining that search results do not contain CCM documents.
The customer had tried recreating the index but to no avail and called me to take a look.
I first enabled trace on one of the infrastructure nodes (*=info: com.ibm.connections.search.index.indexing.*=all: com.ibm.connections.search.seedlist.*=all: com.ibm.connections.httpClient.*=all: com.ibm.connections.search.index.indexing.EcmFilesIndexer=all) as detailed in http://www-01.ibm.com/support/docview.wss?uid=swg21636559
I then created a back ground index as detailed in, Creating a back ground index and tailed the trace.log and SystemOut.log. To create the background index I ran the following commands on the Windows server.
wsadmin.bat -lang jython -username wasadmin -password ********
SearchService.startBackgroundIndex(“c:/IBM/Connections/background/crawl”, “c:/IBM/Connections/background/extracted”, “c:/IBM/Connections/background/index”, “ecm_files”)
I found that the indexing process finished abruptly about 3500 documents in (with another 6500 odd remaining).
[10/09/14 09:15:59:293 BST] 0000007a SeedlistPagin < com.ibm.connections.search.seedlist.parser.impl.SeedlistPaginationHandler resolve RETURN https://connections.acme.com/dm/atom/library/8DB6D184-AAF5-41F3-A28D-D1B7BEF17967%3BC11D230C-66A5-4CEB-8906-EAB19DFE0B8D/document/%7B5DEBC165-CDF6-4672-8300-A3345507867F%7D/media/%33%35%20%28%32%30%31%34%29%20%34%33%2d%38%35%20%54%68%65%20%53%79%73%74%65%6d%73%20%54%61%6e%74%6164%66?follow=true
[10/09/14 09:15:59:293 BST] 0000007a SystemErr R [Fatal Error] :23466:346: An invalid XML character (Unicode: 0x2) was found in the element content of the document.
[10/09/14 09:15:59:293 BST] 0000007a SeedlistEntry 2 com.ibm.connections.search.seedlist.crawler.impl.SeedlistEntryIterator hasNext CLFRW0063E: SAX parser error.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x2) was found in the element content of the document.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
I took the URL (which has been edited) and logged in using an administrative account and was provided with a pdf. I initially believed that it must have been the contents of the document that caused the problem so I uploaded the same document to a 4.5 CR4 server I run in the lab and couldn’t reproduce the problem.
I raised a PMR and they came back and said that problem is likely to be due a special character in the description and not in the document itself.
I looked at the trace.log and found reference to the seedlist xml that was being processed at the time.
[10/09/14 09:52:26:121 BST] 0000007a SeedlistPersi > com.ibm.connections.search.seedlist.crawler.impl.SeedlistPersistenceManager getSeedlistDirs ENTRY ecm_files
[10/09/14 09:52:26:121 BST] 0000007a SeedlistPersi < com.ibm.connections.search.seedlist.crawler.impl.SeedlistPersistenceManager getSeedlistDirs RETURN ecm_files, [c:\IBM\Connections\background\crawl\seedlists-ecm_files-initial-1410267828454]
[10/09/14 09:52:26:121 BST] 0000007a SeedlistPersi < com.ibm.connections.search.seedlist.crawler.impl.SeedlistPersistenceManager getSeedlistDir RETURN c:\IBM\Connections\background\crawl\seedlists-ecm_files-initial-1410267828454
[10/09/14 09:52:26:121 BST] 0000007a SeedlistFetch 3 seedlistFile = [c:\IBM\Connections\background\crawl\seedlists-ecm_files-initial-1410267828454\1410267828454-00007.xml]
[10/09/14 09:52:26:121 BST] 0000007a SeedlistFetch 2 Retrieving seedlist content: https://connections.acme.com/dm/atom/seedlist/myserver?useLocalFS=true&Start=3500&Action=GetDocuments&Format=xml&Range=500
[10/09/14 09:52:26:121 BST] 0000007a SeedlistFetch 3 Retrieving seedlist from file: 1410267828454-00007.xml
I opened the xml in Notepad++ and searched for the document name which I obtained from the URL previously and found a match. In one of the fields I see the following.
I provided the community and library that the document resided in and the customer couldn’t view the description data in the web browser. The customer made some changes to the field via the FileNet interface and once the special character was removed the data showed in the web browser.
To check whether the index is created correctly after this change I ran the background index again but wrote the files to a new location. If you run the command again to the same location as the initial background index then it will fail because the seedlist will not have been recreated and the original special character is retained.
To speed things up, copy the extracted files from the previ0us location to the new extracted files. This customer had over ten thousand CCM documents so extracting them all again was time consuming.
I had to iterate this process four times until all the special characters were removed. Once you have an INDEX.READY file then I repeated the process for all the applications by copying over the extracted files and using SearchService.startBackgroundIndex(“c:/IBM/Connections/background/crawl”, “c:/IBM/Connections/background/extracted”, “c:/IBM/Connections/background/index”, “all_configured”) which built an index successfully.
I then used the steps in the IBM wiki to replace the current with the new index.
It turns out that the customer used a scripted import facility to import all the documents into CCM and this process introduced these characters.