Max-parallelism (and all other admin settings) going back to defaults all the time

server
query
n1ql

#1

Couchbase Server Enterprise Edition 5.5.1 build 3511.

We are testing on a multi server, each with 8 cores. There are servers dedicated to Query service.

In the past, we noticed that increasing max-parallelism improves our performance dramatically.

However, we learned that each service restart, the number goes back to “1” (and so do other settings like pipeline-batch etc).

In addition, recently (we can’t tell if it was a version change or us increasing the load) - we noticed that under load these values reset themselves without a service restart. The moment we increase the load of the clients working with the data - one after the other, all query servers will reset themselves to defaults. Usually within seconds to 1-2 minutes.

Two questions:

  1. How do we change the admin/settings once and for all?
  2. What can cause those “resets” (it leads me to believe the service did restart due to load, which is probably bad sign too…)

Thanks


#2

Query Service Settings are not persistent MB-21711. You can set per query level. https://docs.couchbase.com/server/6.0/settings/query-settings.html


#3

Yes - that I figured, but why is that? this is a huge oversight. Why not allow to set it in some config? The cbq-engine command even has -max-parallelism flag - so it was designed to be started with a different number.

Also - why is max-parallelism default = “1” on a multi-core server that is dedicated to query service? defaulting to “0” would make more sense to me.

And lastly - what about the issue of the server resets itself without a restart?


#4

max-parallelism - 0 or 1 debatable. More than 1 mean how many operators (duplicate) run in parallel when possible. This requires more memory/CPU. This can impact scalability. That is why it kept at 1 and configurable.

Not persisting settings is in convenient and there is open improvement. cc @keshav_m


#5

Thanks @vsr1 - I’m not sure if you’re a CB employee or just a helpful guy here in the forum. In any case, it seems your answer is about the theoretical side of things - I understand what the value does.
[As a side note, we are running the query services on dedicated servers (no other CB service or anything else is running there). There are multiple of them, and these are 32GB RAM 8 cores machines]
As our testing showed many times, increasing max-parallelism improves our performance dramatically (also mentioned in the first post). So far for the theory and need.

Not making it sticky is more than an inconvenience, as there are no ways to intervene during cbq-engine restarts (to our knowledge). So in order not to “forget” to set it to “0” - one will have to write some kind of watchdog to see when the service restarted or the value has changed and if so - re-send requests to change admin values (repeat per query server of course).

And - once again - the other, more worrisome question still stands - why do these numbers reset back to default in our environment within seconds of increasing client load. I hope someone “official” from CB can help pinpoint that?


#6

Check query.log why value is reset.

  1. “Query Configuration changed for max-parallelism.” User explicitly set
  2. “cbq-engine started” Query service is restarted some reason.

#7

I see multiple “cbq-engine started” in logs.
here is grep from one server. Can we see something from this?

# grep -n "cbq-engine started" /opt/couchbase/var/lib/couchbase/logs/query.log
4028:_time=2018-12-14T15:18:22.698-05:00 _level=INFO _msg=cbq-engine started version=2.0.0-N1QL datastore=http://127.0.0.1:8091 request-cap=256 timeout=0s servicers=32 n1ql_feat_cntrl=0 loglevel=INFO pipeline-cap=512 pipeline-batch=16 max-concurrency=8 plus-servicers=128 scan-cap=512 request-size-cap=67108864 max-index-api=3
12132:_time=2018-12-14T15:19:26.305-05:00 _level=INFO _msg=cbq-engine started loglevel=INFO pipeline-cap=512 n1ql_feat_cntrl=0 version=2.0.0-N1QL datastore=http://127.0.0.1:8091 pipeline-batch=16 request-cap=256 servicers=32 scan-cap=512 timeout=0s request-size-cap=67108864 max-index-api=3 max-concurrency=8 plus-servicers=128
22144:_time=2018-12-14T15:20:23.089-05:00 _level=INFO _msg=cbq-engine started plus-servicers=128 scan-cap=512 request-size-cap=67108864 datastore=http://127.0.0.1:8091 max-concurrency=8 loglevel=INFO servicers=32 pipeline-cap=512 timeout=0s max-index-api=3 version=2.0.0-N1QL pipeline-batch=16 request-cap=256 n1ql_feat_cntrl=0
30682:_time=2018-12-14T15:20:58.148-05:00 _level=INFO _msg=cbq-engine started max-concurrency=8 loglevel=INFO servicers=32 scan-cap=512 datastore=http://127.0.0.1:8091 request-cap=256 request-size-cap=67108864 n1ql_feat_cntrl=0 version=2.0.0-N1QL plus-servicers=128 max-index-api=3 timeout=0s pipeline-cap=512 pipeline-batch=16
39610:_time=2018-12-14T15:22:15.371-05:00 _level=INFO _msg=cbq-engine started pipeline-cap=512 loglevel=INFO plus-servicers=128 scan-cap=512 datastore=http://127.0.0.1:8091 servicers=32 max-concurrency=8 request-size-cap=67108864 max-index-api=3 n1ql_feat_cntrl=0 timeout=0s version=2.0.0-N1QL pipeline-batch=16 request-cap=256
47877:_time=2018-12-14T15:23:51.766-05:00 _level=INFO _msg=cbq-engine started version=2.0.0-N1QL loglevel=INFO plus-servicers=128 max-index-api=3 max-concurrency=8 pipeline-cap=512 request-size-cap=67108864 datastore=http://127.0.0.1:8091 servicers=32 scan-cap=512 timeout=0s pipeline-batch=16 request-cap=256 n1ql_feat_cntrl=0
57148:_time=2018-12-14T15:25:54.193-05:00 _level=INFO _msg=cbq-engine started pipeline-batch=16 max-index-api=3 max-concurrency=8 loglevel=INFO plus-servicers=128 timeout=0s datastore=http://127.0.0.1:8091 request-cap=256 version=2.0.0-N1QL servicers=32 scan-cap=512 pipeline-cap=512 request-size-cap=67108864 n1ql_feat_cntrl=0
65237:_time=2018-12-14T15:27:20.962-05:00 _level=INFO _msg=cbq-engine started scan-cap=512 loglevel=INFO pipeline-batch=16 request-cap=256 max-index-api=3 n1ql_feat_cntrl=0 datastore=http://127.0.0.1:8091 servicers=32 pipeline-cap=512 request-size-cap=67108864 version=2.0.0-N1QL plus-servicers=128 timeout=0s max-concurrency=8

#8

Check query.log around that time why it is restarted. Did you see any errors. Is System ran out of memory and cbq-engine killed by OS? (you can check system messages)


#9

System (OS) has lots of free mem:

# free -h
              total        used        free      shared  buff/cache   available
Mem:            31G        1.8G         21G        305M        8.6G         28G
Swap:           31G        158M         31G

The one thing that I do see in the query.log file that sort-of coincides with the cbq-started are these lines:
(for example, the last one happened 3 sec. before the last restart):

fatal error: concurrent map read and map write

What do they mean?


#10

hum. Did you see any query info around that message or panic.
Is this from query.log. What type of query you are running.

You might be hitting MB-31049. You can try with 5.5.2


#11

Ok, so after this short delay - we upgraded the entire cluster to 5.5.3 (by the way, i wish there was a more graceful way to do that on a multi-node production system - like a gradual rollover - but that’s for a different post).

Problem persists - I change it to "max-parallelism":8 and it goes back to 1 within seconds to minutes on all query servers.

re “seeing query info” - i believe the actual queries are not logged in the query.log (wrong?) . This happens when we are testing scale and multi clients working. We have a bunch of programs running updates (probably both N1QL and API-based updates and mutations). Still - it’s a database - it should be able to handle concurrent requests to update data without crashing all the time.

Any other idea where to look?


#12

https://docs.couchbase.com/server/6.0/install/upgrade.html
https://blog.couchbase.com/couchbase-rolling-upgrades/

As previously mentioned query settings are not persistent and there is open enhancement.

query info is logged in query.log only when query got panic.

If setting resetting means your system ran out of memory and OS might have killed process. You need to create right indexes or lower work load to reduce memory.
There is no transactions, but concurrent updates are supported as long they don’t update same row.


#13

@vsr1

I don’t see a specific query info in the log.
Somf of the stuff I see between two cbq-engine started are these (mind you, the last 2 I’m looking at are 3 minutes apart):

  • bunch of Starting audit worker ..
  • healthy nodes 6 failed node 0 unhealthy node 0 add node 0
  • bunch of successfully reach indexer at ...
  • Some Using plain authentication for user <ud>@cbq-engine</ud>
  • Some [Queryport-connpool:XXX:9101] open new connection ...
  • switched currmeta from 77 -> 77 force true (few times, not consecutive)
  • Bunch like Rollback time has changed for index inst 7984743597241962203. New rollback time 1544855403581401799
  • Then, it starts with messages about failing to do insert, a-la: _time=2018-12-19T09:41:15.416-05:00 _level=ERROR _msg=Failed to perform <ud>insert</ud> on key <ud>XXXX</ud> for Keyspace YYYY. (and over the next few minutes more and more of these)
  • again switched currmeta from 77 -> 77 force true and then few [Queryport-connpool:XXX:9101] open new connection ... followed by several repeats of :
  • serviceChangeNotifier: received PoolChangeNotification
  • Refreshing indexer list due to cluster changes or auto-refresh.
  • and finally, the mentioned fatal error: concurrent map iteration and map write followed by a million (ok, over 400) stack-trace looking dumps and then the restart.

( I don’t mind sharing the whole log, or snippet of it, but prefer not to post it here with real document IDs and server names - if you don’t mind, I can send it directly to whoever is willing to look into it )

Like said, the whole thing in this case was 3 minutes (and repeats every 2-3 minutes)

Re “resources” like I mentioned, each server has 32GB of RAM, and according to the stats page, it never drops (those servers are running only the query service - data and index are on separate servers). See image below.

We do have what we believe are the right indexes (but what does this mean? we can debate “right” indexes forever). Also, we use combination of N1QL and API direct changes (insert/upsert/mutate_in). The API ones will not really use GSI indexes (maybe the primary one). All indexes we have are created with replicas, and we are running separate index servers from the query (and data) ones. Each has at least 32GB RAM installed.

Re concurrency - I understand that 2 updates can not occur on same “row”. I expect one of them to fail (we are using CAS when it is relevent to fail it and redo the update if needed). However, I do not expect the server to practically crash every time two clients are trying to update the same “row”, so I hope this is not what you are suggesting.

Image from the same query server where the log was taken, over same period:
free_ram


#14

When u see fatal error: concurrent map iteration and map write
there might be stack trace. Could you please post that.


#15

Here’s the first few traces (but there are about 400 of them):

fatal error: concurrent map iteration and map write

goroutine 69320 [running]:
runtime.throw(0xfc112b, 0x26)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/panic.go:596 +0x95 fp=0xc4245f53d8 sp=0xc4245f53b8
runtime.mapiternext(0xc4245f54a8)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/hashmap.go:737 +0x7ee fp=0xc4245f5488 sp=0xc4245f53d8
github.com/couchbase/query/value.sortedNames(0xc422fa4030, 0xc423283b00, 0x2, 0x10, 0x0, 0x0, 0xf32fe0)
	goproj/src/github.com/couchbase/query/value/object.go:620 +0xd6 fp=0xc4245f5518 sp=0xc4245f5488
github.com/couchbase/query/value.objectValue.MarshalJSON(0xc422fa4030, 0x0, 0x0, 0x0, 0x0, 0x0)
	goproj/src/github.com/couchbase/query/value/object.go:48 +0x174 fp=0xc4245f55d8 sp=0xc4245f5518
encoding/json.marshalerEncoder(0xc42235f290, 0xf32fe0, 0xc422fa4030, 0x15, 0xc422fa0100)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:451 +0x9f fp=0xc4245f5648 sp=0xc4245f55d8
encoding/json.(*encodeState).reflectValue(0xc42235f290, 0xf32fe0, 0xc422fa4030, 0x15, 0xc422fa0100)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:323 +0x82 fp=0xc4245f5680 sp=0xc4245f5648
encoding/json.interfaceEncoder(0xc42235f290, 0xe1f200, 0xc4258655e0, 0x94, 0xc425860100)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:617 +0xdb fp=0xc4245f56c0 sp=0xc4245f5680
encoding/json.(*mapEncoder).encode(0xc42000e148, 0xc42235f290, 0xe32bc0, 0xc424d77230, 0x15, 0xe30100)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:690 +0x589 fp=0xc4245f5830 sp=0xc4245f56c0
encoding/json.(*mapEncoder).(encoding/json.encode)-fm(0xc42235f290, 0xe32bc0, 0xc424d77230, 0x15, 0xc424d70100)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:706 +0x64 fp=0xc4245f5870 sp=0xc4245f5830
encoding/json.(*encodeState).reflectValue(0xc42235f290, 0xe32bc0, 0xc424d77230, 0x15, 0x100)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:323 +0x82 fp=0xc4245f58a8 sp=0xc4245f5870
encoding/json.(*encodeState).marshal(0xc42235f290, 0xe32bc0, 0xc424d77230, 0xc420f50100, 0x0, 0x0)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:296 +0xb8 fp=0xc4245f58e0 sp=0xc4245f58a8
encoding/json.Marshal(0xe32bc0, 0xc424d77230, 0x3, 0xc420024650, 0xc420024600, 0xc420f54b60, 0xc422383380)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/encoding/json/encode.go:161 +0x6e fp=0xc4245f5928 sp=0xc4245f58e0
github.com/couchbase/go-couchbase.(*Bucket).WriteCasWithMT(0xc420496e00, 0xc420945080, 0x23, 0x2000000, 0x0, 0x1571c2edd2810000, 0xe32bc0, 0xc424d77230, 0x0, 0x0, ...)
	godeps/src/github.com/couchbase/go-couchbase/client.go:869 +0x3b7 fp=0xc4245f5a00 sp=0xc4245f5928
github.com/couchbase/go-couchbase.(*Bucket).CasWithMeta(0xc420496e00, 0xc420945080, 0x23, 0x2000000, 0x0, 0x1571c2edd2810000, 0xe32bc0, 0xc424d77230, 0x1571c30646400000, 0x0, ...)
	godeps/src/github.com/couchbase/go-couchbase/client.go:851 +0x93 fp=0xc4245f5a78 sp=0xc4245f5a00
github.com/couchbase/query/datastore/couchbase.(*keyspace).performOp(0xc4208dd950, 0x2, 0xc4223f7000, 0x10, 0x40, 0x344659980000000a, 0xed3ac4ff2, 0x1, 0xb5b3, 0x1908600)
	goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1263 +0xaa9 fp=0xc4245f5c28 sp=0xc4245f5a78
github.com/couchbase/query/datastore/couchbase.(*keyspace).Update(0xc4208dd950, 0xc4223f7000, 0x10, 0x40, 0xc421344390, 0x0, 0x0, 0xc4245f5d08, 0x472640)
	goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1295 +0x56 fp=0xc4245f5c88 sp=0xc4245f5c28
github.com/couchbase/query/execution.(*SendUpdate).flushBatch(0xc4244505a0, 0xc4265ab600, 0xed3ac4f00)
	goproj/src/github.com/couchbase/query/execution/update_send.go:154 +0x75d fp=0xc4245f5d80 sp=0xc4245f5c88
github.com/couchbase/query/execution.(*base).enbatchSize(0xc4244505a0, 0x18d31a0, 0xc4208f4810, 0x189c720, 0xc4244505a0, 0x10, 0xc4265ab600, 0x18d31a0)
	goproj/src/github.com/couchbase/query/execution/base.go:707 +0x1a2 fp=0xc4245f5de0 sp=0xc4245f5d80
github.com/couchbase/query/execution.(*base).enbatch(0xc4244505a0, 0x18d31a0, 0xc4208f4810, 0x189c720, 0xc4244505a0, 0xc4265ab600, 0xaef701)
	goproj/src/github.com/couchbase/query/execution/base.go:721 +0x69 fp=0xc4245f5e30 sp=0xc4245f5de0
github.com/couchbase/query/execution.(*SendUpdate).processItem(0xc4244505a0, 0x18d31a0, 0xc4208f4810, 0xc4265ab600, 0x1)
	goproj/src/github.com/couchbase/query/execution/update_send.go:56 +0x90 fp=0xc4245f5e78 sp=0xc4245f5e30
github.com/couchbase/query/execution.(*base).runConsumer.func1()
	goproj/src/github.com/couchbase/query/execution/base.go:547 +0x25d fp=0xc4245f5f00 sp=0xc4245f5e78
github.com/couchbase/query/util.(*Once).Do(0xc424450698, 0xc4209d8738)
	goproj/src/github.com/couchbase/query/util/sync.go:51 +0x68 fp=0xc4245f5f28 sp=0xc4245f5f00
github.com/couchbase/query/execution.(*base).runConsumer(0xc4244505a0, 0x189a160, 0xc4244505a0, 0xc4265ab600, 0x0, 0x0)
	goproj/src/github.com/couchbase/query/execution/base.go:552 +0xaf fp=0xc4245f5f80 sp=0xc4245f5f28
github.com/couchbase/query/execution.(*SendUpdate).RunOnce(0xc4244505a0, 0xc4265ab600, 0x0, 0x0)
	goproj/src/github.com/couchbase/query/execution/update_send.go:52 +0x5c fp=0xc4245f5fc0 sp=0xc4245f5f80
runtime.goexit()
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4245f5fc8 sp=0xc4245f5fc0
created by github.com/couchbase/query/execution.(*Sequence).RunOnce.func1
	goproj/src/github.com/couchbase/query/execution/sequence.go:95 +0x404

goroutine 1 [chan receive, 2 minutes]:
runtime.gopark(0xfdf1a0, 0xc420aca298, 0xfa7e83, 0xc, 0x17, 0x3)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:271 +0x13a fp=0xc420d3faa0 sp=0xc420d3fa70
runtime.goparkunlock(0xc420aca298, 0xfa7e83, 0xc, 0xc42048fb17, 0x3)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:277 +0x5e fp=0xc420d3fae0 sp=0xc420d3faa0
runtime.chanrecv(0xde2000, 0xc420aca240, 0xc42048fc38, 0x1, 0xc42048fbe0)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/chan.go:513 +0x371 fp=0xc420d3fb80 sp=0xc420d3fae0
runtime.chanrecv1(0xde2000, 0xc420aca240, 0xc42048fc38)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/chan.go:395 +0x35 fp=0xc420d3fbb8 sp=0xc420d3fb80
main.signalCatcher(0xc42016d900, 0xc420021080)
	goproj/src/github.com/couchbase/query/server/cbq-engine/main.go:298 +0x135 fp=0xc420d3fc78 sp=0xc420d3fbb8
main.main()
	goproj/src/github.com/couchbase/query/server/cbq-engine/main.go:288 +0x145d fp=0xc420d3ff88 sp=0xc420d3fc78
runtime.main()
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:185 +0x20a fp=0xc420d3ffe0 sp=0xc420d3ff88
runtime.goexit()
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420d3ffe8 sp=0xc420d3ffe0

goroutine 17 [syscall, 2 minutes, locked to thread]:
runtime.goexit()
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420044fe8 sp=0xc420044fe0

goroutine 2 [force gc (idle), 2 minutes]:
runtime.gopark(0xfdf1a0, 0x1907e70, 0xfaaad0, 0xf, 0xfdf014, 0x1)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:271 +0x13a fp=0xc420032768 sp=0xc420032738
runtime.goparkunlock(0x1907e70, 0xfaaad0, 0xf, 0xc420000114, 0x1)
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:277 +0x5e fp=0xc4200327a8 sp=0xc420032768
runtime.forcegchelper()
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:226 +0x9e fp=0xc4200327e0 sp=0xc4200327a8
runtime.goexit()
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200327e8 sp=0xc4200327e0
created by runtime.init.4
	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:215 +0x35

#16

Thank you. Stack trace will help. Will investigate and will update findings. Is it possible share the UPDATE statement .
Opened MB-32412.

If you are using max_parallelism set that to 1 for UPDATE statement and see if the panic goes away.


#17

Thanks for opening the ticket.

It’s hard to pin-point the exact query that made this - as it’s running multiple client-instances on a remote server (Also assuming the clocks are synced to the milli between the servers…).

However, going by the clock times, etc - it is very likely that the query that proceeded this is something along these lines (thousands of these get fired all the time):

UPDATE
	bucket1 USE KEYS 'key123'
SET
	bucket1.events = ARRAY_APPEND(IFMISSING(bucket1.events, []), "0001545212572142-46afa5af-42a9-0001"),
	ts_millis = GREATEST(ts_millis, 1545203722000),
	ts = MILLIS_TO_LOCAL(GREATEST(ts_millis, 1545203722000), "1111-11-11 11:11:11"),
	cnt = IFMISSING(cnt, 0)+1,
	is_type = True
 RETURNING *;

.
The max_parallelism is already at “1” as the engine restarted itself multiple times (and that’s the default it goes back to). I changed it to “8” couple days ago, but again, this lasts for 2-3 minutes tops.


#18

Could you please post exact CB version that generated this stack.


#19

Running on Ubuntu, and the GUI says:
You are running Couchbase Server Enterprise Edition 5.5.3 build 4039.

We used
couchbase-server-enterprise_5.5.3-ubuntu16.04_amd64.deb
for installation.

The cbq-start line looks like :

_time=2018-12-19T13:42:08.586-05:00 _level=INFO _msg=cbq-engine started servicers=32 scan-cap=512 version=2.0.0-N1QL max-concurrency=8 request-cap=256 request-size-cap=67108864 max-index-api=3 datastore=http://127.0.0.1:8091 loglevel=INFO timeout=0s plus-servicers=128 pipeline-cap=512 pipeline-batch=16 n1ql_feat_cntrl=0


#20

Hi. Any news on this subject?