CBL IOS Pull/Push Status not showing offline device has no internet connection, but is connected to wifi network

Caroline · July 25, 2017, 10:48pm

Hi,

I’m having an issue where CBLReplication is not returning the expected .status when a wifi connection exists but it cannot reach the internet/gateway

My scenario:
-iPad is joined to a local wifi network with internet connection
–a continuos pull/push is established, switching between active and idle as documents are generated.
-Disconnect the internet from the wifi router, iPad is still joined to the wifi network (now with now way to reach the gateway because it has lost internet)
–The continuous pull/push now remains idle, and is never updated to show offline.

Is this expected behavior? Is there another way I can check the pull/push to verify my replication state is healthy?

I’ll note, if I turn the wifi off on the iPad, the status immediately changes to offline. I assumed it would behave similarly if the internet connection was lost.

Any advice will be appreciated.

jens · July 25, 2017, 11:46pm

Disconnect the internet from the wifi router, iPad is still joined to the wifi network (now with now way to reach the gateway because it has lost internet)
–The continuous pull/push now remains idle, and is never updated to show offline.

I would expect it to go offline eventually, though it might take a few minutes. The iPad has no direct way of knowing you’ve disconnected the router. It can tell if it tries to send data to the server (it won’t get any acknowledgement and TCP will close the socket), and it can indirectly tell because it’s not receiving any more data. Our continuous puller sends and receives a “heartbeat” message once in a while, to help detect losing connectivity.

I’ll note, if I turn the wifi off on the iPad, the status immediately changes to offline. I assumed it would behave similarly if the internet connection was lost.

No. If you make a change on the iPad itself, apps are notified that the network interface has gone down. If you make a change on some other device like a router, there is no direct way to find out — it’s not like the router is going to broadcast a message saying that it lost its Internet connection. (This is a pretty common misconception.)

Caroline · July 26, 2017, 12:58am

Thanks for the quick reply @jens

In my tests, the replication state has never changed to offline~ I’ve waited 60+ minutes.

We actually have the heartbeat value set to 60000, to help resolve these kinds of issues quicker.
However, even after the heartbeat notices the connection is unreachable, the pull status does not change to “offline”

with logging enabled on the iOS I see this:

2017-07-25 19:08:45.595 ChangeTracker: CBLSocketChangeTracker[0x14f526020 bucket]: Timeout ...
2017-07-25 19:08:45.598 CBLSocketChangeTracker[0x14f526020 bucket]: Connection error #2, retrying in 4.0 sec: NSURLError[-1001, "Timeout", <http://gatewayIP:4984/bucket/_changes?feed=longpoll&heartbeat=60000&style=all_docs&since=32078391::32109377&filter=sync_gateway/bychannel>]
error hitting sync gateway: The request timed out.
error hitting sync gateway: -1001

After this initial error, I repeatedly see this (with longer retry times as it runs):

2017-07-26 00:17:57 +0000: CBLSocketChangeTracker[0x14f526020 bucket]: Connection error #9, retrying in 512.0 sec: NSPOSIXError[51, “Network is unreachable”]

Again, the status never changes and remains idle.

I can can also create documents and attempt to push them.
The push status will become active and attempt to push the documents. The push receives an error, but again, the status does not change to “offline.”

Error Domain=NSURLErrorDomain Code=-1004 “Could not connect to the server.” UserInfo={NSUnderlyingError=0x14f2ed1b0 {Error Domain=kCFErrorDomainCFNetwork Code=-1004 “(null)” UserInfo={_kCFStreamErrorCodeKey=51, _kCFStreamErrorDomainKey=1}}, NSErrorFailingURLStringKey=http://gatewayIP:4984/bucket/_revs_diff, NSErrorFailingURLKey=http://gatewayIP:4984/bucket/_revs_diff, _kCFStreamErrorDomainKey=1, _kCFStreamErrorCodeKey=51, NSLocalizedDescription=Could not connect to the server.}

From my observations, the push changes to active, and then returns to idle after it queues the revisions. But it never changes to offline state.

No. If you make a change on the iPad itself, apps are notified that the network interface has gone down. If you make a change on some other device like a router, there is no direct way to find out — it’s not like the router is going to broadcast a message saying that it lost its Internet connection. (This is a pretty common misconception.)

You’re right, when I disable wifi on the iPad, the pull status updates immediately to offline. Further, I can turn off the router, and the and the replication status changes to offline in a number of seconds.
However, if I disconnect the internet connection from the router, the replication status never changes to offline.

Through the logs, I see that it recognizes it can’t reach the gateway, but the replication status never changes to reflect that.

Any ideas?

Thanks

jens · July 26, 2017, 4:02pm

The Offline status specifically means that the device’s Internet connection is offline, not that it can’t connect to the server. The situation the replicator sees here is identical to if only the server itself went down — all the replicator knows is that it’s trying to open a TCP socket and failing.

I can turn off the router, and the and the replication status changes to offline in a number of seconds.

Again, this affects the device itself — its WiFi interface loses its connection, so the IP stack notifies apps that the interface is down, triggering the offline state.

TL;DR: To us it seems like unplugging the router and entering Airplane mode are very similar, but at the TCP/IP level they’re not. An IP device has very little insight into the status of other nodes along the routing path.

Caroline · July 26, 2017, 4:37pm

Thanks for the reply @jens ,

I need to be able to recognize when a replicator doesn’t t actually have a valid connection.
Is there perhaps another way to do this instead of using the replicator.status?

You say the replicator knows it cannot reach the gateway, but how can I access that information? The pull never receives an error.

The Offline status specifically means that the device’s Internet connection is offline, not that it can’t connect to the server.

That statement makes sense. However, in my case, the iPad only has a wifi network connection, but it cannot reach the internet. So, the iPad internet connection is offline. It seems the device doesn’t necessarily know that it has internet problems. But, The TCP connection does, and again, this is never reflected in the pull status.
Leaving the replication state in idle/active isn’t an accurate reflection of the status of the replicator.
Because, the replicators are actually failing to replicate.

Is there another way I should be checking if the replication can’t pull/ push data?

Thanks again

jens · July 26, 2017, 11:18pm

I need to be able to recognize when a replicator doesn’t t actually have a valid connection.

Would you mind explaining what you need this for? We’re actually debating something similar in the 2.0 API, i.e. whether to expose a “connecting” state when the replicator is trying to open a TCP connection but doesn’t have one yet.

You say the replicator knows it cannot reach the gateway, but how can I access that information? The pull never receives an error.

It should fail after a few minutes; it will keep retrying, with an exponential backoff interval, but give up after a few tries.

Being unable to connect to a server is a common transient error, so we don’t give up immediately when it happens.

It seems the device doesn’t necessarily know that it has internet problems. But, The TCP connection does,

Actually it doesn’t. If a socket is open but idle (no data being sent in either direction), there are no packets being sent, so TCP has no way to know whether the destination is still reachable. I think it will eventually send a packet to check, but that takes at least 90 minutes.

As I said, our pull replicator sends “heartbeat” messages every few minutes, so when one of those fails to be delivered the replicator should discover that the connection is broken. At that point it’ll start trying to reconnect, and eventually give up.

If it doesn’t give up within about 5 minutes, that would be a bug.

Caroline · July 27, 2017, 6:16pm

@jens Thanks for the reply,

Yes, it be a great help (for debugging) to be able to recognize if replication is not working.

Currently, we need this functionality for our everyday application purpose.
We need to prevent our customers from taking certain actions when they don’t have a valid replication connection.

For example, our application allows users to create documents based off of the contents of an existing document. We need to prevent users from doing this when the replication connection isn’t active (so we can avoid users creating entirely separate documents based off of the same original doc… and end up with similar/duplicated docs)

This is also very important for technical support with our customers.
In our application, we have arrows indicating the replication status of the device. If the replication status isn’t valid (idle, active), obviously, the customer can run in to unexpected behavior. Further, If a customer does experience issues, it is incredibly helpful to know if their replication status was valid.
It is very common for our customers internet to drop many times throughout the day—causing replication to stop until the heartbeat makes replicators re-fire, so being able to recognize these periods without valid replication is important.

Additionally, if we can accurately recognize when gateway replication cannot be reached, we can start syncing data Peer to Peer to ensure replication between devices.

As a developer, I need to know what the status of the replication is to be able to properly debug any issues.

I would say any reason the current “offline” status would be used, can also be applied to when the replication cannot reach the gateway~send/receive data. They would be used for the same purpose.

Any time a replicator is not working correctly (not able to pull/push data), we need to be able to recognize that and add proper logic to avoid potential issues in our app.

Let me know if you have any further questions about this.

Alternatively, revealing the status/errors of the TCP socket connection could work. Is that something I can currently access? Ultimately, we need some way to recognize this error in the current 1.4 build. Any suggestions?

As I said, our pull replicator sends “heartbeat” messages every few minutes, so when one of those fails to be delivered the replicator should discover that the connection is broken. At that point it’ll start trying to reconnect, and eventually give up.

When you say “Give up,” do you mean when the heartbeat stops trying, it should then set the replication status to a failed state? that is never happening. From my observations, the heartbeat stops after it reaches the 512 second retry time, but it never changes the replication status.

Thanks for the detailed reply

jens · July 27, 2017, 8:06pm

You’ll be happy to know that we’ve decided to add the Connecting state in 2.0. (I’m definitely happy; I argued in favor of it.) It won’t be in 1.x, though; we’re not making any more changes of that magnitude, and it’s also harder to determine this state of the 1.x replicator due to its HTTP-based design.

Yes, it be a great help (for debugging) to be able to recognize if replication is not working.

Logging is very helpful for this, if you’re running the app with Xcode attached.

For example, our application allows users to create documents based off of the contents of an existing document. We need to prevent users from doing this when the replication connection isn’t active

There’s not really a way to prevent this. No matter how hard you try, there can be race conditions where two clients do this at the same time. It’s a distributed system, so what works best is optimistic concurrency, i.e. assume conflicts can happen, but resolve them after the fact.

That being said, it sounds like what you want in this case is to wait for the (continuous) replicator’s state to become Idle. That means it’s caught up with current changes.

Alternatively, revealing the status/errors of the TCP socket connection could work. Is that something I can currently access?

No; it’s not something even we can currently access, because it’s buried down inside the implementation of Apple’s frameworks (NSURLSession, etc.) And it’s complicated by the fact that NSURLSession will open up to 8 sockets at once to the server, to parallelize access. So at any time some might be connected, others disconnected.

From my observations, the heartbeat stops after it reaches the 512 second retry time, but it never changes the replication status.

What I said is true of one-shot replications, but not continuous ones. I should have been more specific; sorry. A continuous replication is supposed to keep trying no matter what, so you don’t have to babysit it. You’re right that the timeout will increase to some maximum, then stay there.