KernTime is too long

Hi.
We are running 3 nodes cluster in PROD. A lot of our queries have long KernTIme. For example this simple INFER:

        {
        "completed_requests": {
          "clientContextID": "INTERNAL-60973a77-7dcc-49e4-8090-eef393288f5e",
          "elapsedTime": "6.226580274s",
          "errorCount": 0,
          "node": "10.1.22.24:8091",
          "phaseOperators": {
            "authorize": 1
          },
          "phaseTimes": {
            "authorize": "38.146085ms",
            "instantiate": "23.285µs",
            "parse": "442.369µs",
            "plan": "10.267µs",
            "run": "6.225899249s"
          },
          "remoteAddr": "10.1.22.24:43639",
          "requestId": "12e8556a-2a41-4e43-9e6e-d013b8912e2b",
          "requestTime": "2020-06-13 01:40:25.853717719 +0000 UTC",
          "resultCount": 1,
          "resultSize": 57083,
          "scanConsistency": "unbounded",
          "serviceTime": "6.226388805s",
          "state": "completed",
          "statement": "infer `collections`  with {\"infer_timeout\":5};",
          "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36 (Couchbase Query Workbench (6.0.1-2037-enterprise))",
          "users": "Administrator"
        },
        "plan": {
          "#operator": "Sequence",
          "#stats": {
            "#phaseSwitches": 1,
            "execTime": "5.803µs"
          },
          "~children": [
            {
              "#operator": "Authorize",
              "#stats": {
                "#phaseSwitches": 3,
                "execTime": "7.078µs",
                "servTime": "38.139007ms"
              },
              "privileges": {
                "List": [
                  {
                    "Priv": 7,
                    "Target": ":collections"
                  }
                ]
              },
              "~child": {
                "#operator": "InferKeyspace",
                "#stats": {
                  "#itemsOut": 1,
                  "#phaseSwitches": 7,
                  "execTime": "13.675µs",
                  "kernTime": "6.18731965s"
                },
                "keyspace": "collections",
                "namespace": "default",
                "using": "default",
                "with": {
                  "infer_timeout": 5
                }
              }
            },
            {
              "#operator": "Stream",
              "#stats": {
                "#itemsIn": 1,
                "#itemsOut": 1,
                "#phaseSwitches": 7,
                "execTime": "6.071µs",
                "kernTime": "6.225515333s"
              }
            }
          ],
          "~versions": [
            "2.0.0-N1QL",
            "6.0.1-2037-enterprise"
          ]
        }
      }

What could be the cause?

It looks like CPUs are busy. https://docs.couchbase.com/server/5.0/monitoring/monitoring-n1ql-query.html#monitor-profile-details

Yes, that was my thought. But it looks like CPU utilization doesn’t exceed 30 % This is one week chart across all nodes.

As Sitaram says, high kernel time on its own is due to the kernel not having enough resources to schedule goroutines

I think what we have to do here is measure cbq-engine CPU time with a much finer granularity, and get the CPU usage at a moment in time you requests are taking longer than expected?
Is it just infer, or do you have any other example?

Hi @Marco_Greco. Thank you for link, I will check it out.
Actually it happens almost with every query. I just took infer example as simple sample. Here is another one. Also this one has state that is still running O_0. I can’t understand what’s wrong and why it happens.

"completed_requests": {
      "clientContextID": "5::6",
      "elapsedTime": "26.9284669s",
      "errorCount": 0,
      "node": "10.1.22.25:8091",
      "phaseCounts": {
        "fetch": 91504,
        "indexScan": 92011
      },
      "phaseOperators": {
        "authorize": 1,
        "fetch": 1,
        "indexScan": 1,
        "update": 1
      },
      "phaseTimes": {
        "authorize": "47.703µs",
        "fetch": "26.848636211s",
        "indexScan": "529.958694ms",
        "instantiate": "58.398µs",
        "parse": "1.695583ms",
        "plan": "2.907803ms",
        "run": "26.923624808s",
        "update": "20.364µs"
      },
      "remoteAddr": "10.1.245.104:40553",
      "requestId": "62d9a843-722c-4a7a-a605-a40e28304cf5",
      "requestTime": "2020-06-19 23:27:27.634976443 +0000 UTC",
      "resultCount": 0,
      "resultSize": 0,
      "scanConsistency": "unbounded",
      "serviceTime": "26.928296554s",
      "state": "closed",
      "statement": "UPDATE assets\n                            SET assets.ancestry = $ancestry,\n                                assets.workspaceId = $workspaceId,\n                                assets.modifiedOn = CLOCK_UTC()\n                            WHERE assets.type = 'asset'\n                              AND assets.folderId = $folderId",
      "userAgent": "couchbase-net-sdk/2.7.16.1 (clr/.NET Core 4.6.28325.0) (os/Linux 4.14.177-104.253.amzn2.x86_64 #1 SMP Fri May 1 02:01:13 UTC 2020)",
      "users": "local:Administrator,Administrator,Administrator"
    },
    "plan": {
      "#operator": "Sequence",
      "#stats": {
        "#phaseSwitches": 2,
        "execTime": "65h21m59.0765914s",
        "kernTime": "289ns",
        "state": "running"
      },
      "~children": [
        {
          "#operator": "Authorize",
          "#stats": {
            "#phaseSwitches": 4,
            "execTime": "65h21m59.076599968s",
            "kernTime": "236ns",
            "servTime": "45.943µs",
            "state": "running"
          },
          "privileges": {
            "List": [
              {
                "Priv": 8,
                "Target": "default:assets"
              }
            ]
          },
          "~child": {
            "#operator": "Sequence",
            "#stats": {
              "#phaseSwitches": 2,
              "execTime": "65h21m59.076612323s",
              "kernTime": "154ns",
              "state": "running"
            },
            "~children": [
              {
                "#operator": "IndexScan3",
                "#stats": {
                  "#itemsOut": 92017,
                  "#phaseSwitches": 368073,
                  "execTime": "428.187584ms",
                  "kernTime": "26.394264759s",
                  "servTime": "101.937237ms"
                },
                "index": "asset-all",
                "index_id": "7b1d8e0ad4505645",
                "index_projection": {
                  "primary_key": true
                },
                "keyspace": "assets",
                "namespace": "default",
                "spans": [
                  {
                    "exact": true,
                    "range": [
                      {
                        "high": "\"asset\"",
                        "inclusion": 3,
                        "low": "\"asset\""
                      }
                    ]
                  }
                ],
                "using": "gsi"
              },
              {
                "#operator": "Sequence",
                "#stats": {
                  "#phaseSwitches": 4,
                  "execTime": "65h21m59.076664779s",
                  "kernTime": "429ns",
                  "state": "running"
                },
                "~children": [
                  {
                    "#operator": "Fetch",
                    "#stats": {
                      "#itemsIn": 91505,
                      "#itemsOut": 91488,
                      "#phaseSwitches": 377432,
                      "execTime": "579.786612ms",
                      "kernTime": "75.596092ms",
                      "servTime": "26.270618713s"
                    },
                    "keyspace": "assets",
                    "namespace": "default"
                  },
                  {
                    "#operator": "Filter",
                    "#stats": {
                      "#itemsIn": 91488,
                      "#itemsOut": 3,
                      "#phaseSwitches": 182985,
                      "execTime": "2.599371351s",
                      "kernTime": "24.32591738s"
                    },
                    "condition": "(((`assets`.`type`) = \"asset\") and ((`assets`.`folderId`) = $folderId))"
                  },
                  {
                    "#operator": "Clone",
                    "#stats": {
                      "#itemsIn": 3,
                      "#itemsOut": 3,
                      "#phaseSwitches": 17,
                      "execTime": "1.389894ms",
                      "kernTime": "26.923904079s"
                    }
                  },
                  {
                    "#operator": "Set",
                    "#stats": {
                      "#itemsIn": 3,
                      "#itemsOut": 3,
                      "#phaseSwitches": 15,
                      "execTime": "1.168129ms",
                      "kernTime": "26.924135502s"
                    },
                    "set_terms": [
                      {
                        "path": "(`assets`.`ancestry`)",
                        "value": "$ancestry"
                      },
                      {
                        "path": "(`assets`.`workspaceId`)",
                        "value": "$workspaceId"
                      },
                      {
                        "path": "(`assets`.`modifiedOn`)",
                        "value": "clock_utc()"
                      }
                    ]
                  },
                  {
                    "#operator": "SendUpdate",
                    "#stats": {
                      "#itemsIn": 3,
                      "#phaseSwitches": 13,
                      "execTime": "1.158867ms",
                      "kernTime": "26.924241649s",
                      "servTime": "2.85755ms"
                    },
                    "alias": "assets",
                    "keyspace": "assets",
                    "namespace": "default"
                  }
                ]
              },
              {
                "#operator": "Discard",
                "#stats": {
                  "#phaseSwitches": 3,
                  "execTime": "1.114044ms",
                  "kernTime": "26.9242681s"
                }
              }
            ]
          }
        },
        {
          "#operator": "Stream",
          "#stats": {
            "#phaseSwitches": 3,
            "execTime": "4.36603ms",
            "kernTime": "26.924322875s"
          }
        }
      ],
      "~versions": [
        "2.0.0-N1QL",
        "6.0.1-2037-enterprise"
      ]
    }
  }

your node is definitely overloaded - the filter has an execution time of 2.5secs for 90K documents, but what’s striking is your initial sequence seems in a very bad way, running for 65hrs?
Could you collect stacks and attach them?

@Marco_Greco But why than CPU utilization is under 30%? Excuse me, what is stacks?

Sorry - I missed one crucial bit: your servTime for the fetch is 26 seconds.
That indicates that you have a suboptimal path , and you are having to fetch a lot of documents to only update 3.
Also, as a guideline, on 6.0 N1QL is able to fetch 2k documents per CPU, and each fetch would normally be submillisecond, so 26second for 90K documents seems to indicate that your KV nodes are struggling?

Just to give you another idea, the clone operator is taking 1.4ms to make a copy of 3 documents (0.5ms a piece) and 1.15ms to change 3 documents, again that’s quite excessive.
I would have a look at the CPU time taken by cbq-engine and the CPU time on the overall node while you run a query that you know it runs for long,
It is likely that you’ll find that the load on your node is higher than you think.

Re stacks - you can collect a set of goroutine stacks on a running cbq-engine, that’s always a good thing to have a look at on performance problems. I kind of assumed that you had dealt with support before and knew how to collect them.
I think, now that I have spotted the high service time in the fetch operator, there’s no need to collect those.

@Marco_Greco Sorry, but doesn’t servTime mean that phase waiting for other service to complete? In this case it waits for IndexScan which took all this time for kernTime.

How would I do it?

Please, explain how to make it.

Re servTime, as you say, it is the time waiting for other services - for index scan operators it’s the time spent waiting for keys from the indexing service.
For the fetch operator, it’s the time spent waiting for documents from the key value store, hence my suggestion that they KV nodes might be struggling.
It is in fact the other way round, all of the other operator’s kernel times is spent waiting for the fetch operator which in turn is waiting for the KV to provide documents.

Re monitoring the CPU consumption of your node, the" top" os command will show up what’s running, sorted by cpu load? look out for the cbq-engine and memcached processes.
Re collecting operator stacks - it’s not needed, for now.
Look

@Marco_Greco Thanks. Here is top output. Can’t see any odd. How could I check KV service health?