We are benchmarking couchbase and observing a very strange behaviour.
couchbase cluster machines;
2 x EC2 r3.xlarge with General purpose 80GB SSD (Not EBS optimised ) , IOPS 240/3000.
Data Ram Quota: 22407 MB
Index Ram Quota: 2024 MB
Index Settings (default)
Per Node Ram Quota: 22407 MB
Total Bucket Size: 44814 MB (22407 x 2)
Replicas enabled (1)
Disk I/O Optimisation (Low)
- Each node runs all three services
1 x EC2 m4.xlarge General purpose 20 GB SSD (EBS Optimised), IOPS 60/3000.
The client is running the ‘YCSB’ benchmark tool.
PS: All the machines are residing within the same VPC and subnet.
ycsb load couchbase -s -P workloads/workloada -p recordcount=100000000 -p core_workload_insertion_retry_limit=3 -p couchbase.url=http://HOST:8091/pools -p couchbase.bucket=test -threads 20 | tee workloadaLoad.dat
While everything works as expected
The average ops/sec is ~21000
The ‘disk write queue’ graph is floating between 200K - 600K (periodically drained).
The ‘temp OOM per sec’ graph is at constant 0.
When things starting to get weird
After about ~27M documents inserted we start seeing ‘disk write queue’ is constantly rising (Not getting drained)
At about ~8M disk queue size the OOM failures are starting to show them selves and the client receives ‘Temporary failure’ from couchbase.
After 3 retries of each YCSB thread, the client stops after inserting only ~27% of the overall documents.
Even when the YCSB client stopped running, the ‘disk write queue’ is asymptotically moving towards 0, and is drained only after ~15 min.
When we benchmark locally on MacBook with 16GB of ram + SSD disk (local client + one node server) we do not observe such behaviour and the ‘disk write queue’ is constantly drained in a predictable manner.