Orient Me and mongoDB connection failures

I have been banging against a mongoDB wall for a good few days as explained in another post but I’m slowly getting there. The problem I was facing was that the migration application in the people-migrate container wasn’t working.

# npm run start migrate
npm info it worked if it ends with ok
npm info using npm@3.10.8
npm info using node@v6.9.1
npm info lifecycle people-datamigration-service@0.0.1~prestart: people-datamigration-service@0.0.1
npm info lifecycle people-datamigration-service@0.0.1~start: people-datamigration-service@0.0.1

> people-datamigration-service@0.0.1 start /usr/src/app
> cross-env NODE_ENV=production node lib/server.js “migrate”

2017-04-20T13:19:56.761Z – info: [migrator] Mongo DB URL: mongodb://mongo-0.mongo:27017,mongo-1.mongo:27017,mongo-2.mongo:27017/relationshipdb?replicaSet=rs0&readPreference=primaryPreferred&wtimeoutMS=2000
2017-04-20T13:19:56.766Z – info: [migrator] Mongo DB URL: mongodb://mongo-0.mongo:27017,mongo-1.mongo:27017,mongo-2.mongo:27017/datamigrationdb?replicaSet=rs0&readPreference=primaryPreferred&wtimeoutMS=2000
2017-04-20T13:19:56.767Z – info: [migrator] Mongo DB URL: mongodb://mongo-0.mongo:27017,mongo-1.mongo:27017,mongo-2.mongo:27017/profiledb?replicaSet=rs0&readPreference=primaryPreferred&wtimeoutMS=2000
Connection fails: MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: getaddrinfo ENOTFOUND mongo-0 mongo-0:27017]
It will be retried for the next request.

/usr/src/app/node_modules/mongodb/lib/mongo_client.js:338
          throw err
          ^
MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: getaddrinfo ENOTFOUND mongo-0 mongo-0:27017]
    at Pool.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/topologies/server.js:327:35)
    at emitOne (events.js:96:13)
    at Pool.emit (events.js:188:7)
    at Connection.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/connection/pool.js:274:12)
    at Connection.g (events.js:291:16)
    at emitTwo (events.js:106:13)
    at Connection.emit (events.js:191:7)
    at Socket.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/connection/connection.js:177:49)
    at Socket.g (events.js:291:16)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at connectErrorNT (net.js:1020:8)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

If I specify the location of migrationConfig I get the same result.

# npm run start migrate config:/usr/src/app/migrationConfig

Oddly enough, if I run the above command outside of /usr/src/app/ directory it fails. It doesn’t actually read the file you specify, it always looks for migrationConfig in relation to the working directory where you are when you issue it. Of course I may have the syntax wrong but if I do not then it’s a bit sloppy.

On to the problem which seems to be name resolution. The error I was getting was

Connection fails: MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: getaddrinfo ENOTFOUND mongo-0 mongo-0:27017]

It seems to be trying to connect to mongo-0 over 27017.

# kubectl exec -it $(kubectl get pods | grep people-migrate | awk ‘{print $1}’) bash

# ping mongo-0
ping: mongo-0: Name or service not known

# ping mongo
PING mongo.default.svc.cluster.local (10.1.67.163) 56(84) bytes of data.
64 bytes from 10.1.67.163 (10.1.67.163): icmp_seq=1 ttl=63 time=0.063 ms

# ping mongo-0.mongo
PING mongo-0.mongo.default.svc.cluster.local (10.1.67.163) 56(84) bytes of data.
64 bytes from 10.1.67.163 (10.1.67.163): icmp_seq=1 ttl=63 time=0.087 ms

This was the cause, “mongo-0” was not resolving for me and this is confirmed by another that there container works the same. To work around this I added an entry to the container’s host file.

# cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.1.67.176     people-migrate-4029352936-n8fzl
10.1.67.163     mongo-0 mongo-0.mongo

Now the migration app works but I also have mongo-sidecar errors which I’m not clear on as to whether they are supposed to be there.

Update – 27/04/17

This only gets me so far. This allows me to get the data migrated from Connections Profiles in to MongoDB but when the container is torn down and replaced with another the host file entry is gone. Also, there are the following errors in the logs for itm-services containers that I cannot exec to to update the hosts file.

Connection fails: MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: getaddrinfo ENOTFOUND mongo-0 mongo-0:27017]
It will be retried for the next request.

/usr/src/app/node_modules/mongodb/lib/mongo_client.js:338
          throw err
          ^
MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: getaddrinfo ENOTFOUND mongo-0 mongo-0:27017]
    at Pool.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/topologies/server.js:327:35)
    at emitOne (events.js:96:13)
    at Pool.emit (events.js:188:7)
    at Connection.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/connection/pool.js:274:12)
    at Connection.g (events.js:291:16)
    at emitTwo (events.js:106:13)
    at Connection.emit (events.js:191:7)
    at Socket.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/connection/connection.js:177:49)
    at Socket.g (events.js:291:16)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at connectErrorNT (net.js:1020:8)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

Update 28/04/17

During the (excellent) Connections Pink Developer Workshop hosted by IBM we were given access to a SoftLayer server running CentOS 7.3 where we installed CfC and Orient Me. The installer worked just fine with no signs of the mongoDB errors above. I have come across two others who have the same errors I have documented above.

I sparked up a CentOS 7.3 server on Bluemix for a few hours and the install with the same binaries worked just fine. I compared what yum has installed and installed all on my local CentOS 7.3 server and the same problem occurred. I changed my NIC device name swapping it from ens192 to match Bluemix and eth0 but the result is the same.

Update 05/05/17

This week I was lucky to visit the Dublin labs with a customer discussing Watson Workspace, Watson Work Services, XPages and Pink. I used a couple of hours of those two days to have a chat with David McDonagh and a colleague of his Bruno to look into the problems I was having with Mongo.

The crux of it was that the node I was using as the master, boot, worker and proxy was under a great deal of strain, mainly CPU strain, which seemed to be causing the problem. This would make sense since the differences between my ESXi server and Bluemix are the resources available to it.

I bumped up the resources available to the single node but although the install went OK the problems persisted. It wasn’t until today that I got it working but not with a single node but rather two nodes. Node 1 ran boot, master and proxy roles whilst node 2 was the worker node. I gave a generous helping of resources to both and the thankfully the installation went smoothly and more importantly the errors above are no more.

I have some further work to see how much I can scale the resources back because it does have an impact on my ESXi host and the other guests on it.

Orient Me and some things I’ve come across and wrestled with

Having gained some experience of Docker and CfC (IBM Spectrum Conductor for Containers) before Connections 6.0 was released I thought this would be easy to set up but I must admit I’m struggling.

My setup is 3 CentOS servers for Orient Me with another for DB2/SDI and another for Connections hosting the deployment manager.

Here are some things I have come across which I’d like to add to as I come across other problems.

DNS

Working on a beefy ESXi server running at home I normally manage most things using hosts file which has worked really well, up until now. I won’t steal from Roberto Boccadoro’s blog post but suffice to say I couldn’t get it to work using hosts file even after editing nsswitch.conf. I had to rely on spoofing DNS, internally, on my router by updating /jffs/configs/hosts.add to include all my Connections servers.

Even with this I found that the migration script in people-migrate container would fail because so in this case I had to add my host files to /etc/hosts which got me past that step.

MongoDB

I had to uninstall and reinstall a couple of times. On reinstall I had problems with the migration application (people-migrate) connecting to mongoDB. I was able to check the databases and connect to them.

# kubectl exec -it mongo-0 bash

#mongo

rs0:PRIMARY> show dbs
admin  0.000GB
local  0.000GB

The migration script was failing to connect and I couldn’t fathom why. I uninstalled again and this time I removed the persistent volumes and recreated them and now the migration script gets further but fails with the following exception.

2017-04-12T12:01:42.751Z – info: [migrator] Mongo DB URL: mongodb://mongo-0.mongo:27017/relationshipdb?replicaSet=rs&readPreference=primaryPreferred&wtimeoutMS=2000
2017-04-12T12:01:42.757Z – info: [migrator] Mongo DB URL: mongodb://mongo-0.mongo:27017/datamigrationdb?replicaSet=rs&readPreference=primaryPreferred&wtimeoutMS=2000
2017-04-12T12:01:42.758Z – info: [migrator] Mongo DB URL: mongodb://mongo-0.mongo:27017/profiledb?replicaSet=rs&readPreference=primaryPreferred&wtimeoutMS=2000
2017-04-12T12:01:54.018Z – info: [migrator] total request number: 1
2017-04-12T12:01:54.021Z – info: [populator] Start to populate URL:
–“https://connections.domain.com/profiles/admin/atom/profiles.do?ps=100&#8221;

2017-04-12T12:01:59.417Z – error: [migrator] errors:[{“profileKey”:”16ff2775-2ace-4db8-8e54-56adcc62a5fb”,”externalId”:”382AB352-F9AE-D6E4-8025-7D2C004A7248″,”created”:1491998514408,”orgId”:”a”,”id”:”FAKE_ID”,”error”:{}},{“profileKey”:”8af449b4-0357-4bed-a7c7-c0e5285ba826″,”externalId”:”932ED7B3-988D-9EFC-8625-79E3005B2B62″,”created”:1491998514409,”orgId”:”a”,”id”:”FAKE_ID”,”error”:{}},{“profileKey”:”a9294f18-ee72-49d0-8a44-cf02abe6d4d2″,”externalId”:”0873E9A9-7E12-0609-8025-7D38003BFD71″,”created”:1491998514410,”orgId”:”a”,”id”:”FAKE_ID”,”error”:{}},{“profileKey”:”b6994f86-7525-48b6-92da-900393382e11″,”externalId”:”0F64A6F8-927B-483C-8625-79E3005AC781″,”created”:1491998514410,”orgId”:”a”,”id”:”FAKE_ID”,”error”:{}}]
Connection fails: MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: connection 4 to mongo-0:27017 timed out]
It will be retried for the next request.
Connection fails: MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: connection 5 to mongo-0:27017 timed out]
It will be retried for the next request.

/usr/src/app/node_modules/mongodb/lib/mongo_client.js:338
throw err
^
MongoError: failed to connect to server [mongo-0:27017] on first connect [MongoError: connection 5 to mongo-0:27017 timed out]
at Pool.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/topologies/server.js:327:35)
at emitOne (events.js:96:13)
at Pool.emit (events.js:188:7)
at Connection.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/connection/pool.js:274:12)
at Connection.g (events.js:291:16)
at emitTwo (events.js:106:13)
at Connection.emit (events.js:191:7)
at Socket.<anonymous> (/usr/src/app/node_modules/mongodb-core/lib/connection/connection.js:187:10)
at Socket.g (events.js:291:16)
at emitNone (events.js:86:13)
at Socket.emit (events.js:185:7)
at Socket._onTimeout (net.js:339:8)
at ontimeout (timers.js:365:14)
at tryOnTimeout (timers.js:237:5)
at Timer.listOnTimeout (timers.js:207:5)

Redis client

In the knowledge center it alludes as to how to test connecting to Redis from the Connections node. If you want to install the client and try for yourself here are the instructions IBM deemed not necessary to write down for you.

# su -c ‘rpm -Uvh http://download.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-9.noarch.rpm&#8217;
# yum install redis

# redis-cli -p 30379
127.0.0.1:30379> set foo bar
OK
127.0.0.1:30379> get foo
“bar”
127.0.0.1:30379>

Odd pod behaviour

I believe I have an underlying problem with the persistent volumes and over night this happened.

# kubectl get pods

zookeeper-controller-3-2528439515-xz702   0/1       OutOfpods   0          13h
zookeeper-controller-3-2528439515-xz79d   0/1       OutOfpods   0          14h
zookeeper-controller-3-2528439515-xzqc9   0/1       OutOfpods   0          13h
zookeeper-controller-3-2528439515-xzzbl   0/1       OutOfpods   0          16h
zookeeper-controller-3-2528439515-z0kwf   0/1       OutOfpods   0          13h
zookeeper-controller-3-2528439515-z13kn   0/1       OutOfpods   0          17h
zookeeper-controller-3-2528439515-z2lsn   0/1       OutOfpods   0          13h
zookeeper-controller-3-2528439515-z6mc5   0/1       OutOfpods   0          14h
zookeeper-controller-3-2528439515-z74nj   0/1       OutOfpods   0          13h
zookeeper-controller-3-2528439515-z97jp   0/1       OutOfpods   0          17h
zookeeper-controller-3-2528439515-zd2js   0/1       OutOfpods   0          4h
zookeeper-controller-3-2528439515-zdc3t   0/1       OutOfpods   0          14h
zookeeper-controller-3-2528439515-zk5bw   0/1       OutOfpods   0          16h

# kubectl get pods | wc -l
2114

There were thousands of pods. I believe they were created faster than they could be garbage collected.

I deleted all the pods in the “OutOfpods” status using the following command.

# kubectl get pod | cut -d ” ” -f 1 | xargs -n1 -P 10 kubectl delete pod

Shutdown

To shutdown my servers I have been running the following to stop all pods.

# docker stop $(docker ps -a -q)

I’m not sure whether I am better off using a different variation of above to stop all pods

# kubectl get pod | cut -d ” ” -f 1 | xargs -n1 -P 10 kubectl delete pod

Is there a better prescribed way of doing this?

Enabling profiles events for Orient Me

I did what was asked of me in the knowledge center but there is little indication of it having worked. In the documentation it states that I should see “OrientMe configured properly – both properties are enabled.” Where should I see that, in SDI’s ibmdi.log or in one of the application servers SystemOut.log? I have looked at both and I do not see this written.

Anyway, I’ll hopefully add  to this as I go. If anyone has come across these problems and found a resolution to them, please get in touch.