Sync Gateway replication stuck with stateFnActiveFetchBulkGet got event: {FETCH_BULK_GET_FAILED unexpected EOF}

I’ve inherited a solution which uses Sync gateway to replicate between two often disconnected servers, and have an issue where the sg_replicate script is dying with errors such as the below.

Been working on this a few days now, the remote server is often disconnected and when online firewall off and slow so remote upgrades are impossible. However based not the unexpected EOF part I have rebuild the sg_replicate cli binary using version 1.9 of Go and used that, however we are still stuck, and it’s blocking.

I’ve also tried (using the last ‘GET’ document ID in the sync gateway logs) removing the attachments from documents which has in some cases worked, but I suspect the revisions with attachments still need to replicate), however this is a ridiculously painstaking process. I do have some conflicts which I’m working through, but I can’t believe the error below would be caused by a document conflict?

Basically looking for advice and pointers as about 3 days in on this problem and no end in sight. All I’m really looking for is a means (in my logical mind) to identify the at fault document, somehow zap it from the replication process and let the rest of the data through. Unfortunately sg_replicate seems to block and queue everything up behind it :frowning:
TIA

2018/02/14 11:27:29 Replicate: [cloud-push] Transition from stateFnActiveFetchChangesFeed -> stateFnActiveFetchRevDiffs
2018/02/14 11:27:29 Replicate: [cloud-push] new state: 0x6de1c0
2018/02/14 11:27:29 Replicate: [cloud-push] stateFnActiveFetchRevDiffs
2018/02/14 11:27:31 Replicate: [cloud-push] revs diff resp: &{200 OK 200 HTTP/1.1 1 1 map[Content-Type:[text/plain; charset=utf-8] Server:[Couchbase Sync Gateway/1.3.1] Date:[Wed, 14 Feb 2018 11:31:36 GMT]] 0xc4203739c0 -1 [] false true map[] 0xc42013e800 0xc4200be630}, err: <nil>
2018/02/14 11:27:31 Replicate: [cloud-push] event: &{FETCH_REVS_DIFF_SUCCEEDED map[03fdfb22-0d63-45db-b96a-53ffb1a85366:{[19-45cc4fca84d0f01ddc17ce1aead78555]} fc42068b-66d3-46b8-958c-6fa3452da094:{[21-fadcd7cc7892c3b4000d530347bf52d0]} 0a8afbf7-f0d5-4690-9095-dd14820c2ee8:{[18-b946561579c733b1b0e3941dedb7f71a 14-1a6f5288e76213f5dd0a88da39dd198b]} 61c31170-ff26-432a-8d4d-894d1b127f11:{[18-6a3b44292dd80204c20ac4c173677ca9]} 8a3d6b57-b4a5-47a6-82cb-171061f9599a:{[20-23bc3205d6d96ef44dfd7a4cd120a93f]} 85532d6b-f062-4a77-ba3e-5428f8817144:{[2-2d6637c9d7964eff5e9914650729ced6]}]}
2018/02/14 11:27:31 Replicate: [cloud-push] stateFnActiveFetchRevDiffs got event: {FETCH_REVS_DIFF_SUCCEEDED map[8a3d6b57-b4a5-47a6-82cb-171061f9599a:{[20-23bc3205d6d96ef44dfd7a4cd120a93f]} 85532d6b-f062-4a77-ba3e-5428f8817144:{[2-2d6637c9d7964eff5e9914650729ced6]} 03fdfb22-0d63-45db-b96a-53ffb1a85366:{[19-45cc4fca84d0f01ddc17ce1aead78555]} fc42068b-66d3-46b8-958c-6fa3452da094:{[21-fadcd7cc7892c3b4000d530347bf52d0]} 0a8afbf7-f0d5-4690-9095-dd14820c2ee8:{[18-b946561579c733b1b0e3941dedb7f71a 14-1a6f5288e76213f5dd0a88da39dd198b]} 61c31170-ff26-432a-8d4d-894d1b127f11:{[18-6a3b44292dd80204c20ac4c173677ca9]}]}
2018/02/14 11:27:31 Replicate: [cloud-push] Transition from stateFnActiveFetchRevDiffs -> stateFnActiveFetchBulkGet
2018/02/14 11:27:31 Replicate: [cloud-push] new state: 0x6dea90
2018/02/14 11:27:31 Replicate: [cloud-push] stateFnActiveFetchBulkGet
2018/02/14 11:27:31 Replicate: [cloud-push] bulkGetUrl http://localhost:44985/ensco122-rig/_bulk_get?revs=true&attachments=true
2018/02/14 11:27:31 Replicate: [cloud-push] bulkGet req &{POST http://localhost:44985/ensco122-rig/_bulk_get?revs=true&attachments=true HTTP/1.1 1 1 map[] {0xc4201cc090} 0x6ca3a0 639 [] false localhost:44985 map[] map[] <nil> map[]   <nil> <nil> <nil> <nil>}
2018/02/14 11:27:31 Replicate: [cloud-push] bulk get resp: &{200 OK 200 HTTP/1.1 1 1 map[Content-Type:[multipart/mixed; boundary="b0705e23a5eb2814a1dd5e611b3a683ce02531dbb1b74219503086a704a5"] Server:[Couchbase Sync Gateway/1.3.1] Date:[Wed, 14 Feb 2018 11:27:31 GMT]] 0xc420384420 -1 [chunked] false false map[] 0xc42013eb00 <nil>}, err: <nil>
2018/02/14 11:27:31 Replicate: [cloud-push] mainPart: &{map[Content-Type:[application/json]] 0xc420092480  map[] {0xc4200a6460} 0 0 <nil> <nil>}.  Header: map[Content-Type:[application/json]]
2018/02/14 11:27:31 Replicate: [cloud-push] contentType: application/json
2018/02/14 11:27:31 Replicate: [cloud-push] boundary:
2018/02/14 11:27:31 Replicate: [cloud-push] Error decoding part: unexpected EOF
2018/02/14 11:27:31 Replicate: [cloud-push] stateFnActiveFetchBulkGet got event: {FETCH_BULK_GET_FAILED unexpected EOF}
2018/02/14 11:27:31 Replicate: [cloud-push] new state: <nil>
2018/02/14 11:27:31 Replicate: [cloud-push] processEvents() is done
2018/02/14 11:27:31 PANIC: Unable to run replication: {cloud-push http://localhost:44985 ensco122-rig [] https://e122-sgw:4zSmy3mW7Fv5uYRnziMr7vYvvkI66Y@sgw01.askaris.it:4984 ensco122 20 0 false false}. Err: Replication Aborted -- main.launchReplications() at cli.go:66
panic: Unable to run replication: {cloud-push http://localhost:44985 ensco122-rig [] https://e122-sgw:4zSmy3mW7Fv5uYRnziMr7vYvvkI66Y@sgw01.askaris.it:4984 ensco122 20 0 false false}. Err: Replication Aborted

goroutine 1 [running]:
github.com/couchbase/clog.Panicf(0x7b6efc, 0x26, 0xc4203845e0, 0x2, 0x2)
	/root/go/src/github.com/couchbase/clog/clog.go:263 +0x10c
main.launchReplications(0x14, 0x3e8, 0xc42010e0e0, 0x2, 0x2)
	/root/gocode/src/github.com/couchbaselabs/sg-replicate/cli/cli.go:66 +0x486
main.main()
	/root/gocode/src/github.com/couchbaselabs/sg-replicate/cli/cli.go:44 +0x4eb

Any more information of what the data looks like (typical number of documents, doc size, max size)?

It sounds like, since there can gaps and possibly poor connectivity, that it’s just trying to load too much into a single response.

You can control the change feed limit and retry period. You might try setting the feed limit way down to send only one document at a time and see how that works. You can tune it up once you see how that goes.

There’s about 30,000 docs in total I think, and they vary in size depending on attachments, but up to about 2MB. I really don’t think it’s a bandwidth / connection thing though as when the remote server is connected the connection is solid but slow (we have a reverse SSL tunnel and that’s been up for over a week in this spell).

The positive news is it’s fixed, but difficult to be 100% on the resolution as I poked and prodded quite a bit (Some may have helped, some may have not). it certainly wasn’t conflicts as there are some of those still in place and it’s syncing quite happily now.

Am certain it’s an attachments issue based on the errors about EOF and other log entries about boundary’s etc, and believe the resolution came down to dropping the revs_limit variable down to 2, and basically creating a rev for any document that was in the last batch that had an attachment, and removing the attachment in that rev. So basically once there were no revs with attachments it all went through.

There were probably 15 docs that I removed attachments from in total, not sure if all were blockers though, there’s an awful lot more that have new, or existing attachments in place, some multiple so again size and bandwidth not the issue I think.

Obviously means we have data loss which is not good, and am keen to avoid this happening again so any pointers gratefully received, equally and advice on better diagnosing the root cause and steps to take to resolve as whilst I got there in the end it took far too long.

Thanks