Android 1.0.3 Sync gateway opens new persistent connection when wi-fi turned off and on


#1

I found some interesting sync gateway behaviour while trying to debug why the number of connections between our applications and our couchbase server continuously grows. We’re using couchbase-android-lite 1.0.3 (working towards updating but our dev is still trying to get our app to work with 64 bit) with sync gateway enterprise 1.0.3 behind nginx.

So long as the app is running (in foreground or background) if the user’s device loses network connectivity (even just going from wi-fi to cellular and vice versa) when it re-establishes connectivity the device sends a sync gateway longpoll request, which opens a persistent connection that will never close. Over time these connections build up and cause our server to use up all its open connections. Setting a lower keep alive in nginx helps mitigate the problem by closing some of the extra connections, but others it isn’t able to close and they just build up. Even setting the TCP keep alive time on the server itself to just a minute can’t close them, I’m assuming because sync gateway returns a keep alive request when asked. If I use the recommended sync gateway config settings (1000 requests over same connection, keep alive of 360s) it just causes these to build up faster. With a total user base (not concurrent) of only ~4000 we built up 15000 established TCP connections to sync gateway in ~4 hours. I checked the lsof file and after some excel work found out it was only 1550 unique IP addresses generating all those connections. If I don’t use the recommended settings then nginx closes the connection but sync gateway doesn’t acknowledge it correctly and I end up with a “can’t identify protocol” open file for the sync gateway process. These eventually close but tend to build up over time and can use up all of sync gateway’s open files.

I’m hoping moving to the next version of couchbase-android-lite will fix this issue (I heard it allows for the use of web sockets instead of longpoll). But in case it doesn’t or a lot of users take a long time to update their apps, is there anything server side I can do in sync-gateway so that it closes connections properly instead of just building up an unlimited number of “can’t identify protocol” files?


#2

What may be happening:

  • Clients drop off the network (without closing their end of the socket)
  • By default TCP will wait a long time (90 min?) before sending an ACK out over an idle socket to see if it’s still alive.
  • These are piling up …

What is supposed to be happening:

  • The server should be trying to send a heartbeat every 300 seconds, which specified by the client as a URL parameter on the request to the _changes endpoint.
  • When trying to send a heartbeat on an idle socket, it should realize approximately within 15 seconds that there is nobody on the other end and close the connection.

Isolating the problem:

  • Is it possible to reproduce the issue with nginx out of the picture and clients connect directly to sync gateway? This could help isolate the problem.
  • Can you post your nginx configuration or at least a snippet of it?
  • Can you post your sync gw config or at least a snippet which has the keep alive changes you mentioned?
  • Can you post the relevant lsof output to a github gist (or similar) and post a link to it?
  • When you say “cause our server to use up all its open connections”, can you post the errors? (if its over a page, please post to a github gist)

Also, couchbase lite android does not yet have websocket support, but it’s definitely on the roadmap.


#3

Thanks for the help, this issue has been causing me a lot of trouble for the past few weeks since I’m brand new to server administration and learning this all as I go. It will take a few hours for me to test hitting sync gateway directly, but I can answer the other questions:

nginx.conf relevant lines:

user www-data;
worker_processes 4;
worker_rlimit_nofile 8192;
pid /run/nginx.pid;

events {
    worker_connections 8192;
    # multi_accept on;
}

http {

    ##
    # Basic Settings
    ##

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;
    # server_tokens off;

    # server_names_hash_bucket_size 64;
    # server_name_in_redirect off;

    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    ##
    # Logging Settings
    ##

    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log;

    ##
    # Gzip Settings
    ##

    gzip on;
    gzip_disable "msie6";

    # gzip_vary on;
    # gzip_proxied any;
    # gzip_comp_level 6;
    # gzip_buffers 16 8k;
    # gzip_http_version 1.1;
    # gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript;

    ##
    # nginx-naxsi config
    ##
    # Uncomment it if you installed nginx-naxsi
    ##

    #include /etc/nginx/naxsi_core.rules;

    ##
    # nginx-passenger config
    ##
    # Uncomment it if you installed nginx-passenger
    ##
    
    #passenger_root /usr;
    #passenger_ruby /usr/bin/ruby;

    ##
    # Virtual Host Configs
    ##

    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;
}

relevant sections of nginx config for application in /etc/nginx/sites-enabled (/dbws/ is the web sockets endpoint for iOS, and /db/ is the endpoint for android and web to access sync gateway):

server {
    listen 443;
    #ssl setup stuff
    server_name <host name>;
    client_max_body_size 20M;
   
    # Make site accessible from http://localhost/
    server_name localhost;
   
    location /dbws/ {
        proxy_pass_header Accept;
        proxy_pass_header Server;
        keepalive_requests 1000;
        keepalive_timeout 360s;
        proxy_read_timeout 360s;
        proxy_pass http://localhost:4984/<bucket name>/;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_http_version 1.1;
    }

    location /db/ {
        proxy_pass http://localhost:4984/<bucket name>/;
    }
}

The keep alive stuff was in nginx, I couldn’t find any documentation on all the possible sync gateway config params but these are the ones we have:

{
    "log": ["CRUD", "REST+", "Shadow", "Access"],
    "facebook": {
        "register": true
    },
    "CORS": {
        "Origin":["*"],
        "Headers": ["DNT","X-Mx-ReqToken","Keep-Alive","User-Agent","X-Requested-With","If-Modified-Since","Cache-Control","Content-Type"],
        "MaxAge": 1728000
    },
    "MaxFileDescriptors":20000,
    "databases": {
        "todos": {
            "server": "http://<hostname>:8091",
            "users": {
                "GUEST": {
                    "disabled": false
                }
            },

            "sync": `
// @formatter:off
// sync function

The sync gateway documentation for setting up with an nginx reverse proxy (http://developer.couchbase.com/mobile/develop/guides/sync-gateway/nginx/configuring-nginx-for-sync-gateway/index.html) says to use all the parameters we have specified in the /dbws/ endpoint, but since we have android, web, and iOS using it the most I can specify in our generic sync gateway endpoint are the keepalive_requests, keepalive_timeout, and proxy_read_timeout so that at least the connections will last longer and not keep getting torn down before the longpoll is done. When I do this we don’t get any CLOSE_WAIT or “can’t identify protocol” connections in sync gateway in the lsof output, but the number of connections just keeps growing(situation mentioned in my first post). If I just proxy pass to sync gateway it works ok, though we get a ton of “can’t identify protocol” and CLOSE_WAIT connections in the lsof output. I’ve increased the number of open files for sync gateway now and it seems to be able to remove those dead connections fast enough that we don’t use up all our open files for it, but we haven’t had any significant load happen yet either to truly test it out. This is a snippet of the lsof output for the “can’t identify protocol” CLOSE_WAIT connections, I’m pretty sure it’s because nginx is closing them on sync gateway in a way it doesn’t know how to handle:

sync_gate 14154     root 4239u  sock                0,7       0t0 49412253 can't identify protocol
sync_gate 14154     root 3303u  IPv6           49422016       0t0      TCP localhost:4984->localhost:58581 (CLOSE_WAIT)
sync_gate 14154     root 3145u  IPv6           49426317       0t0      TCP localhost:4984->localhost:59705 (ESTABLISHED)

At the point I ran lsof to get these snippets sync gateway had 2759 “can’t identify protocol” lines, 1046 CLOSE_WAIT, and 279 ESTABLISHED. nginx had only ESTABLISHED connections. I don’t want to expose sync gateway directly to the internet though because nginx is designed to handle many connections very efficiently, we already have ssl setup on it, and this way it hides our actual bucket name and port from clients.

By “use up all our connections” I meant open files. It generally hits that before actually using up all available ports. This is the error I see in the sync gateway log when it happens:

2015/03/10 18:59:04 http: Accept error: accept tcp [::]:4984: too many open files; retrying in 320ms

I’ve increased the number of open files for sync gateway to 20000, so it can handle the load now when I specify a keep_alive in nginx of 65 seconds, but if I change sync_gateway’s keep_alive to 360s it will just keep building up connections. Our android clients longpoll the server every 3 minutes so long as the app is either running or minimized on the client device. That is due to our choice of using continuous replication though and I need to talk to our developer about that because it isn’t needed for our application.

Thanks again for any help or info you can give on this issue.


#4

I checked using a build that directly connected to sync gateway. The only change was to the url used to contact sync gateway, we changed it from http://<hostname>/db/ to http://<hostname>:4984/<bucket name>/, but now it doesn’t do longpoll requests. It just calls for individual files. Before when the url was just http://<hostname>/db/ I would see about every 3 minutes the following line appear in the sync-gateway.log file:

22:29:53.997068 HTTP:  #141: GET /todos/_changes?feed=longpoll&limit=50&heartbeat=300000&style=all_docs&since=56675  (as <user id of user logged into device>)

But now with the direct url to sync gateway it just calls:

18:53:46.723357 HTTP:  #292: POST /todos/_changes  (as <user id of user logged in to device>)

And then retrieves the individual documents that have been changed with GET calls. Nothing has changed in the app or on the server, aside from us changing the url it connects with. I can see that after changing from cellular network to wi-fi 3 times I now have 4 connections with the server straight to sync gateway, even though the app is minimized and after the first two ones I force stopped the app and removed all its data to get it to sync properly. Those connections may disappear a few hours from now though since the ones on the server I was testing with yesterday have gone away. But having multiple connections open with clients for a few hours still isn’t great performance-wise.


#5

Regarding the errors about running out of sockets / file descriptors:

Can you try increasing the max number of file descriptors? On Linux, this is done via:

sudo ulimit -n 65536

This should raise the high water mark and hopefully prevent the error from happening. By default, after 7200 seconds (2 hours) any clients who have disconnected abruptly without closing their connections will have their connections removed and will no longer count against the max total number of sockets for the process.

Also, I’m investigating the issue further and putting notes here: https://github.com/couchbase/sync_gateway/issues/742


#6

Also, can you post log snippets of the errors you are seeing on sync gateway? If posting more than 10 lines, please post it in a github gist or similar service.


#7

I increased the number of open files for sync gateway and that solved the issue of running out of open file connectors (using the MaxFileDescriptors config param in my sync gateway config file). When I was running out of connections I would get this error in the sync gateway log:

2015/03/10 18:59:04 http: Accept error: accept tcp [::]:4984: too many open files; retrying in 320ms

Now I see a lot of broken pipe errors, probably due to connections being closed by nginx:

06:31:06.286093 HTTP: #218745:     --> 599 Write error: write tcp 127.0.0.1:42432: broken pipe  (0.0 ms)

#8

Good to hear!

I forgot to mention that you should persist your changes to the max number of file descriptors, otherwise this will revert back to the default after a reboot.

The full instructions are here.

Regarding nginx, I would compare your configuration against the recommendations in the guide.


#9

I can’t use the recommendations in the guide because the number of connections keeps growing, even after it has more than doubled our total user base. If I use those then it will just keep growing until it completely uses up all the sync gateway connections, even if I set it to something really high like 20000. I fixed it by not following the connections in the guide and setting a low timeout value so that the connections are recycled fast enough that they stop growing forever. Our iOS clients will be using web sockets soon though so I will be able to test that with the recommended settings and see if the number of connections is stable.