WHEN? started using MongoDB
My MongoDB experience started back in the day when the mySQL database of a core system at the company I was working for went critical. It was last July. The data volume was growing so fast that, some analytical processes could no longer finish in a reasonable timeframe, and some of them even took more than a day to complete. As you can expect, it’s mainly due to the normalized data, and some crazy joins. In the past, solution ranging from sql tuning, partitioning, sharding and vertical scaling were adopted to ease the issue, but we all knew that it’s not going to last long. Sharding in particular, requires extra effort from Ops to Programming; Ordering 100+ GB ram for a machine is okay, but it’s expensive & limited afterall. That’s when I began seeking alternatives to mySQL.
WHY? picked MongoDB
The data volume we have is large, i guess? and noSQL was so popular that everyone claimed it as the solution to BIG DATA, so I began digging into various choices available (e.g. Redis, Cassandra). Out of all, it’s MongoDB which matches best with all my following needs.
Access Pattern. It was a user-related table that I was dealing with. With its nature, where changes to user content is rare, and entire user object was often fetched at once, it matches the concept of a document – Each user object is a document.
Scalability & Maintenance. It provided me the solution to scale horizontally, while giving me relatively light work on system administration (in our scale, and small team). e.g. Setting up a MongoDB cluster takes just 5-10 minutes, and adding/removing shards is just commands away – reference.
Community Support. It was only available since Dec 2009, and noSQL is a new trend, so community support is extremely important especially the products is still quite young and immature. Their community base is large, and the official forums offered lots of solutions/practice/faq. Often, their CTO would reply in person to question asked there.
SQL & MapReduce. Better than most of the noSQL solution, it offers some simple SQL and MapReduce layer on top of the dataset, where queries ranging from simple aggregation to complex calculation could be perform. (more importantly, queries are distributed, where they could be run in parallel)
NO SPOF. Single Point of Failure is disastrous and a “NONO” to large system. The replicas set + sharding design it offered, when mixed with careful placement of servers in multiple data centers, could keep the system from failing, or at least the failure chance is rare.
After some testing and planning, MongoDB quickly become the preferred choice for the system storage redesign.
In General, Pros n Cons of MongoDB are as below:
The + Side
- Simple to setup
- Auto-failover
- Hoziontal Scaling (cheap commodity hardware)
- Schemaless
- Able to perform simple MapReduce
- Indexes available
- Capped Collections
- Good Community Support
The - Side
- Gloabl Write Lock
- RAM Size > Data Size + Index Preferred
- Huge amount of update requires compaction once in a while
- Schemaless
- No secondary index built-inn
- Lack of BI tools
- Bugs around
HOW? i used MongoDB
- The user data was imported into the MongoDB where each user corresponded to a document.
- For the sake that, user data was nearly a billion records, pre-splitting was performed before data started importing.
- An internal unique id was chosen as the sharded key, where it was absolutely random and discrete most of the time, such that write/read could be distributed.
- Embedded document was used to stored list of related activities to an user object.
- Abbrev yet recognizable name was used for Keys (or you could use arbitraty short name, while documenting them properly). This is to reduce the data and index size, as Keys even tho being the same for each document, it was stored repeatedly, once for each document. (the price for schemaless design).
- Carefully deployed a few (composite) index to the data set according to usecases with direction (ASC/DESC) considered as well.
- Once the data set is ready, the server was put into production, and constant monitoring was done thru different admin tools, including the internal web-admin and mongostats etc.
Some Tips
- When importing large data set into MongoDB, remember to do “pre-splitting”, else it would be extremely slow.
- Once in a while, compact the data set of different servers (especially if it’s update heavy)
- Replica Set is highly preferred over Master-Slave design.
- Spent enough time to decide a proper Key for sharding as it would affect how often balancer would be triggered.
- Denormalize the data enough to keep it available in different collection sets.
- Make sure RAM > Data + Index, as it affects greatly the query/read performance.
Conclusion
MongoDB is an amazing products which could bring you loads of benefit if used right. Nevertheless, it’s just few years young, so some bugs should still be expected. Just make sure enough testing is done before going live and hang around the supporting forum! :)