CBlite size, bandwidth and replication considerations for 1M docs, 100.000 of them updated every 10 sec

heretyk · October 22, 2015, 2:02pm

Hello,

I am new to Database Management…and even more on Couchbase solution. I would need some guidance.
I have questions about CBlite database size behavior. And also management of its growth.

Here is a summary of how my app works :

My app is using a continuous push/pull replication to a remote CouchBase Server (developement infrastructure).

At the moment, i assign dynamically 4 channels to Documents on my Couchbase Gateway through my Sync Function.
All the documents tagged on these channels are readable, and replicated to every client/user (every connected CBlite client so). Also, each client is making at least 1 revision by 5 seconds on 2 keys (String) on a document. (about 50 characters total).
Finally, i also use liveQuery, filter my results and diplay documents on my user device.

After some days of performance testing, i started to notice slowness behavior on my app. I started monitoring resources and analyze my code.
I noticed that the “DATA” size of my app on the device was increasing constantly…

My app package after initial installation is about 25-30 MB.
In two days, it reaches 100MB.

First question :
Is the way my app works correct ?

Second question :
In production, we might reach 1.000.000 Documents in a few months…Is my CB Lite database going to reach a “huge” size so ?

Last question :
I know that i have to set up compaction…but not sure how to efficiently do this. Can anyone guide me on this ?

Data bucket infos:

            Nodes     1
            Item Count   457
            Ops/sec    36
            Disk Fetches/sec     0
            
            RAM/Quota Usage     50.1MB / 1.19GB
            Data/Disk Usage     122MB /129MB

jens · October 22, 2015, 3:30pm

Are you compacting the database? If you don’t, it will keep growing and not reclaim any unused space.

(In 1.2 this will become automatic, but for now it’s a manual process.)

jens · October 22, 2015, 3:37pm

Each client is creating about 700 revisions per hour? That’s rather a lot if every client replicates to every other — if you have just 1000 clients, each one would be receiving 5000 documents per second. I don’t think most mobile devices could keep up with that; insert rates I’ve seen on iOS peak at maybe 2000/sec. It’s going to use a ton of bandwidth too.

heretyk · October 22, 2015, 4:00pm

I didn’t find the documentation for it. Do you have a link ?

Hmm yes that is a problem so…i definitely need to update these keys every 5 sec…maybe 10 sec would be acceptable…

jens · October 22, 2015, 7:05pm

Both the Database class docs and the API reference discuss compaction.

http://developer.couchbase.com/documentation/mobile/1.1.0/develop/guides/couchbase-lite/native-api/database/index.html

http://developer.couchbase.com/documentation/mobile/1.1.0/develop/references/couchbase-lite/couchbase-lite/database/database/index.html

jens · October 22, 2015, 7:09pm

Every ten seconds is going to be half as much traffic, which is still impractical IMHO. Do you really need every client to receive all of every other client’s updates in real time?

heretyk · October 22, 2015, 9:11pm

Yes.
In the future, maybe i will modify the channel’s assignment so that only relevant documents are replicated to users.
But even then, it is possible that we have like 10.000 users…or 100.000 users that should receive in less than 1 minutes every updates.

What are the options in this scenario ?

jens · October 22, 2015, 10:35pm

You’ll need to do some math. Estimate the rate at which an active user produces data, the size of that data, the fraction of the time a user is active, and the number of users. Then basically multiply those together to get the rate at which data will flow to a user. Now compare to the amount of cell bandwidth that’s likely to be available, and the typical user’s cell data cap. Also consider the storage capacity of a device.

What I’m basically saying is that, from what you’ve said here, your system doesn’t sound practical. No matter what technology you use, it seems like the client will be receiving an impractical firehose of data that grows with the number of users. Worse, your servers have to handle a mega-firehose of data that grows as the square of the number of users. (This is a really big problem. Many services with this kind of connectivity have had a terrible time scaling. Twitter certainly did. LiveJournal, the first true social network, had big problems (that eventually crippled it) and invented memcached as a partial solution … and then memcached eventually grew into Couchbase Server.)

I don’t know what this data consists of or is used for, so I don’t have detailed advice. It seems likely, though, that a client doesn’t need to use all of the data produced by other clients. Maybe it only needs aggregate data; the server could do the aggregation and the clients could pull that. Or maybe a client only really needs to know what a small fraction of the other clients are doing.

heretyk · October 24, 2015, 6:32pm

So if my average document size is 1 MB and that i have 100.000 users that make a revision on key (String) length of 50 characters every 10 seconds :

Bandwith needed in Mbps : (Total users * Average size of revisionned document)/10

100000/10 = 10000 Mbps
So 1250 MBps in peak.

Am i correct ?

Yes it seems pretty ambitious…

heretyk · October 24, 2015, 6:39pm

Still about scaling/sizing…but not about bandwith. If in a few months, i want to upgrade my 3 current nodes (2 vCPUs , 2 GB) to 3 nodes of another configuration (8vCPU, 32 GB, SSD drives).

Can i use the same method as described in your documentation : Online upgrade with swap rebalance ?
http://docs.couchbase.com/admin/admin/Install/upgrade-online.html

So basically add 1 of the new servers to my cluster, put all the data on it, remove the 3 olds nodes from the cluster, then add the 2 others “new” servers to the cluster.

If yes, can i do this if my 3 old servers are under Couchbase 3.0 and the new under 4.0 ?

heretyk · October 27, 2015, 2:38pm

Continuing the discussion from CBlite local database size considerations on devices:

jens:

You’ll need to do some math. Estimate the rate at which an active user produces data, the size of that data, the fraction of the time a user is active, and the number of users. Then basically multiply those together to get the rate at which data will flow to a user. Now compare to the amount of cell bandwidth that’s likely to be available, and the typical user’s cell data cap. Also consider the storage capacity of a device.

What I’m basically saying is that, from what you’ve said here, your system doesn’t sound practical. No matter what technology you use, it seems like the client will be receiving an impractical firehose of data that grows with the number of users. Worse, your servers have to handle a mega-firehose of data that grows as the square of the number of users. (This is a really big problem. Many services with this kind of connectivity have had a terrible time scaling. Twitter certainly did. LiveJournal, the first true social network, had big problems (that eventually crippled it) and invented memcached as a partial solution … and then memcached eventually grew into Couchbase Server.)

I don’t know what this data consists of or is used for, so I don’t have detailed advice. It seems likely, though, that a client doesn’t need to use all of the data produced by other clients. Maybe it only needs aggregate data; the server could do the aggregation and the clients could pull that. Or maybe a client only really needs to know what a small fraction of the other clients are doing.

Hello,

@Jens, can you tell me if i am right ?

To give you more information, i update the user’s location every 5 seconds. The key is a String containing his location.
The document is his profile.

Let’s say that at the moment i do need to receive all the updates for the others users. Is there a way to handle this ?

jens · October 27, 2015, 4:15pm

A document’s ID / key is immutable; you can’t change it. If you use a different key, that’s a different document.

If you want to track a user’s location, you need a document whose key is based on the user’s ID, and whose value is the location (and nothing else, to avoid transmitting redundant data over and over.)

Let’s say that at the moment i do need to receive all the updates for the others users. Is there a way to handle this ?

I don’t think so. From what you’ve said before, the number of updates flowing through the system would be crazy. Your math wasn’t correct: server bandwidth scales as the square of the number of users. 100k users each broadcasting a message every 10 seconds is a billion messages per second going through the server. That’s Google-scale stuff.

heretyk · October 27, 2015, 11:56pm

That’s not what i meant…The Document itself = a User. One of the Key = his location.
But that’s not the only key other users needs to have access to…

You’re right…

Let’s go straight to the point : imagine i have an event like a festival where there is 100 000 people. Each of these people have 1 Document with their location. These 100 000 documents are tagged in the same channel. The client (phone) update their location every 10 sec. There is continuous push/pull replication between each client and the server.

Is there no way to control either replication or to segment data or any other way to make it doable with Couchbase ?

jens · October 28, 2015, 1:33am

Is there no way to control either replication or to segment data or any other way to make it doable with Couchbase ?

I don’t know what to say. You’re fighting against mathematics and information theory here; it has nothing to do with Couchbase specifically.

You said you were new to network programming, IIRC? You may need to spend some time reading or otherwise learning before you try to design something this big.

heretyk · October 28, 2015, 4:02pm

@jens

I am “new” to Android programming, and working with NoSQL, especially couchbase. I have already designed solutions in another programmatic language that use socket and client-server architecture.
But that’s not the point.

If i open a topic and ask for help here, it is that i didn’t find the relevant information about how to manage such volume of data through your replication system. I went through your sizing and scaling guide. Also about everything about Livequeries and replication system.

Now correct if i am wrong, but you are highlighting on Couchbase Server - Modern Cloud-Native, Distributed Database facts like :

"Big Data Integration : Integrate with Hadoop, Spark, and Kafka to enrich, distribute, and analyze operational data both offline and in real time "
Uses cases with Ebay, linkedin etc… with 100M users, 700M documents.

According to these elements, i think it is undestandable that a company like us, preparing a go live and a fast growth, can ask on your dedicated forum if a almost real-time replication of 100.000 Documents scenario is doable or not ?

Now if you can show me a direct link of you documentation telling that : " Even if you use network load-balancers, clustering, optimize channels system and use dedicated and clustered top-class servers with redundant 10 Gbit/s SAN with SSD drives, there is NO WAY to support 100.000 users or more each replicating live 1 Document every 10 seconds" …then i could understand your response.

Don’t take it personally, i appreciate the help you provided in some topic in the past.
But i don’t think you answer is productive and helpful.

I have a meeting with one of your engineer today and a sales Rep as we are going live soon. They already told me that it is possible to handle such amount of data…

I will let you know.

Regards

jens · October 28, 2015, 5:20pm

Uses cases with Ebay, linkedin etc… with 100M users, 700M documents

It’s not the number of users or documents that’s at issue here, it’s the network bandwidth, especially to mobile devices. Server-side people are used to throwing CPUs and 10GB Ethernet adapters at problems, but there’s nothing you can do to increase the bandwidth of a typical cell connection or to raise a user’s monthly bandwidth cap.

Let’s do the math: If you want a full broadcast between 100K devices every 10 sec, and assuming each update is ~100 bytes, that’s a megabyte/sec to each client. That’s probably feasible over a modern 4G connection, but large areas of the world, even US and Europe, don’t have that. Worse, if a user has a 10GB monthly data cap, as I do, it’s going to eat through it in three hours.

Worse, if this is a scenario like a festival or a conference, everyone’s going to be in the same location on the same network. The total bandwidth is 100k times the above, or 100 gbyte/sec. You could throw enough network interfaces at your server to handle that, but it’s going to vaporize any cell tower or WiFi network where the users are.

That’s what I mean by fighting against math and information theory. Sorry if I sounded patronizing, but none of what you’ve written here implies that you’ve thought through the implications.

What you should be asking is: what information does the user want? A user doesn’t want to know where all 99,999 other users are right this second. They may want a heat-map showing where people are concentrated, and they’re probably only going to look at that heat-map once in a while so it doesn’t need to be pushed to them. They may have a small number of other users they do want to track in real-time, something like a buddy list; you can estimate that’s going to max out at about 150 people (look up “Dunbar Number” for why.) They don’t need location info down to the centimeter, so it’s probably not going to be changing every 10 seconds unless that user is walking around; most of the time the location will stay the same for minutes or even hours.

That amount of data is perfectly feasible; in fact it’s exactly the kind of thing IM systems like AIM have been doing for decades at scale.

heretyk · October 28, 2015, 6:44pm

I have been through theses considerations, and didn’t find a way to manage this bandwidth considerations.

I am evaluating either modifying code or global architecture or channels. I can’t find the best way to do this.
As you say, each users won’t need all the location updates.
Only the 150 ones closest to them for instance.

But what would be the best way to do this…presently i calculate distance between every users on the device.

How could i make sure that only the updates of the 150 closest users are replicated to a device… ?

jens · October 28, 2015, 8:19pm

I’m not sure. A better person to ask would be Volker Mische (@vmx), an engineer on Couchbase Server who’s an expert on geographic databases. Couchbase Server has some geo-query features that will help with this.

I can say that this may not be a good use case for Couchbase Mobile, since the data you’re concerned with is ephemeral and rapidly changing. Couchbase Mobile is optimized for data with more persistence, and the way it tracks document histories for conflict management means that it has more overhead in dealing with changes.

The lightest-weight way to do what you want may be to use Couchbase Server directly, with a REST API to your app-server for the device to post its current location and request proximity updates.

bfwarner · October 29, 2015, 6:46am

Having every user in the system know the location of every other user doesn’t sound feasible. It’s just to much information and the amount information grows exponentially with the number of users. With Couchbase rapidly mutating data is expensive because it stores older revisions, and doesn’t reclaim space used by those older revision until compaction time.

I wonder if you can make some simplifications.

You only need to update the document when the user moves, then you don’t have this constant stream of updates when most of time people aren’t moving. You can’t guarantee a consistent stream of data anyway because of the unreliability of cellular networks.
Do you really need to everyone to know about everybody else? if it’s just those that are close why not create a channel based on a larger area. Name the channel longitude,latitude, or zip code then each device only needs to sync those documents in that geographic area
Use iBeacons to track when people are near, use turn the phone into an iBeacon, everybody uses the same UUID, give everybody their a unique major and minor number. Create channels base on major,minor number then subscribe to the channel when some one comes in range.

heretyk · October 30, 2015, 9:00pm

Thanks @bfwarner,

Great ideas. Much appreciated.
Respectfully,

Hello @jens,

We will certainly evolve the architecture your said with a dedicated API, when a certain amount of users will be reached in 1 year or 2.

But for the moment, we prefer to use Couchbase Mobile (to ensure offline/out of network situations & better user experience : less latencies, queries not depending of bandwitdh).

Did you have a feedback of Volker Mische ?