EDIT late 2016, This article is written for Mongo 2.6. If you are technical enough to read this: beg, borrow steal, do whatever it takes to get up to mongo3.2

Please know that: The results of adding too many rows is that it runs slower, not that anything explodes. The business website could run abit slow for a few days until a new server is installed, if necessary. There are eventually hard limits to row counts, but by the time a company is that big, it has a lot of income, and can buy things. Whilst 3rd party articles are useful as they cover many hours work; their use is diminished by the age of the article.

EDIT late 2017. You now need Mongo 3.4.8 or greater. Features for aggregates are improved, again. When I get time I will write a new edition for current times.

For comparison, MySQL innodb is getting abit slow, for about 70%/30% mixed read/writes, on a population of a million rows. In peak time the average query time spikes to 200ms, due to lock stacking. My situation was continual activity with 50 clients concurrently (i.e. up to 50 queries may happen at any point, with an average case of about 30ms, spiking to 200ms). Newer hardware would reduce that abit, but these where network heavy rack boxes. The word innodb is important, as this supports row level locking.

What matters for large datasets to be used in continuous concurrent access, is multi-dimensional. We need to be able to

  • store the data,
  • and access it quickly,
  • and concurrently.

The simple limit for items in a single collection to 2 ^ 32, or abit more than four billion, unless you use wired tiger project. Mongo tries to keep the whole index in RAM. In practical terms that hard limit should never be reached, due to the operational cost of running the database (see RAM). If the data is splitable across multiple collections, that limit could be ignored. An imaginary machine could have four billion items per collection, but matched with a simple hash to change the collection, that limit can be extended another 24000 times 1.

Data modeling

We don't need to hold the entire dataset in RAM, but I am simplifying the technology, as this is purely to get a clearer vision on data volumes. Text strings in the JS used for Mongo are UTF8 by default. I have simplified the storage cost of the wchar to 2bytes. Most of the data under current observation is in ASCII, but an UTF8 char may be 4bytes long. This is a simple model, and I only spent a few hours on it (I have just completed an economics book, where I was reading summaries of peoples PhD theses). Model 1 If we model a document (one item in the above collection) having about a 100 data points, and up to 255 wchar per point; this means each document has raw RAM cost of about 50KB. To make the document work as a item in the collection, model 50% increase for indexes etc. It would be prudent to model a further 50% for everything the operating systems needs to do (network IO for example). This simple guess means realistic upper bound is 10 documents per MB of RAM. To store that massive collection would cost 419,430 GB of RAM. There is no motherboard on the planet that can hold that much RAM. A less verbose item, with an average wchar length of 20 would cost 4K per item. That means 256 items per MB of RAM. To store the maximum collection size would cost 16,384GB of RAM. 20chars would be enough to store an average venue name, an average persons name, an average email address (each item in the collection could be different, but the mean-average would be about that).

Back in the real world, using the same numbers, a 2.5 * 10^6 collection would take 10GB of RAM. As an average spread of data will be also recording numbers (fixed at 8bytes), and some bools (either 8bytes as its actually a number or 1byte if its a separate type), the documents per MB figure is probably too small (so 4times it, leading to 10 * 10^6 item collection). This is a simple model, everything should be measured.

It should be noted that the venue density is unlikely to climb above one per hundred people, as venues have to have customers, and most people don't eat out everyday. I think that 1 per 200 is a more realistic ratio. This paragraph is my opinion, but isn't the focus of the model. Therefore the UK isn't likely to hold more than 1*10^6 venues in total.

To start a new model (model2), it is necessary for the entire index to be held in RAM. For any existing data, we may poll Mongo via db.collection.totalIndexSize() 2, or db.collection.stats().indexSizes 3 for more detail. If you are hazy about what indexes are, here is supplementary reading 4. For new indexes, there is models already 5. To reproduce here “2 * [ n * ( 18 bytes overhead + avg size of indexed field + 5 or so bytes of conversion fudge factor ) ]" (the same reference states that this is likely to be a factor of two too large, I think that analysis is dependant on the input data. This means one needs to measure everything again.). To quote that back, comes to about 400MB for an average text field index.

These models don't mention how-to access the data.

Some real data

There is a fairly thorough report 6 on scaling mixed read and writes for several databases. This is volume of operations, not volume of data. Every NoSQL platform was faster than the above MySQL, but this is newer hardware. Skim down the article and look at the graphs, Mongo is in purple. To express this from a user perspective, at the point we have 13,000 people browsing concurrently (i.e. per minute), we need more AWS boxes, or average page render will be in multiple seconds. For a standard AWS contract, would definitely need more network bandwidth. This report is still the older version of Mongo than current.

To supply opinions for “how to architect a billion item collection”, please read official docs 7 or 8, with more technical details as 9. This was by the guy running Graigs list servers.

Mongo has better sharding capacity than earlier databases. I reference the official Mongo docs 10. This is how to use larger datasets than will realistically fit inside a single box. Sharded datasets have management complexities some of which is discussed in 11. MySQL uses the term “replication”, Mongo has “replica sets”, documented 12.

There is a standardised tool ~ Yahoo Cloud Serving Benchmark (YCSB) 13 14 ~ which can be installed if you need a more detailed survey. The project was started in 2010, but is still under active development. This is has been reviewed 15 quite thoroughly.

For Mongo writes, using SSD is really useful, there are more details 16. This would also be true for every other database.