MONGODB
WAREHOUSE AND AGGREGATOR OF EVENTS
Kyiv Big Data & BI User Group
May 14, 2015
INTRO
Big data is a broad term for data sets so large or complex that
traditional data processing applications are inadequate
@wikipedia
Small Data is when is fit in RAM
Big Data is when is crash because is not fit in RAM
@devops_borat
DESIGNATION
Collect, aggregate and store events from a different sources
Provide load balancing, failover and disaster recovery within
geographically distributed infrastructure
CONDITIONS
Constantly growing events rate
Random intensive access with strict response time (OLTP)
Strict retention period
Existing infrastructure
WHERE IS BIGDATA?
Huge number and variety of event sources
Events are concentrated in "one place"
Response to query is strictly limited
Returned data should be totally consistent
SOLUTIONS
E-L-K SOLUTION
Events LogStash ElasticSearch Kibana
PLUS
M-L-F SOLUTION
Events LogStash MongoDB Flask
(REST API)
COMPARISON
ELASTICSEARCH VS. MONGODB
Search Engine Document Store
Java C++
9+ supported languages 25+ supported languages
(R as one of them)
– Server-side scripting
RESTful API/JSON API –
– MapReduce
– Security features
ELASTICSEARCH VS. MONGODB
Number of shards defined on
index creation
Shards can be added dynamic
Replicas synchronized with
Primary node
Secondaries synchronized
with Primary node
Replicas can be used for data
retrieval
Secondaries can be used for
data retrieval
DECISION
ElasticSearch is a search engine, but MongoDB is a documents
store which is more applicable
Custom REST API is required
Easier infrastructure integration for MongoDB
Overhead in rebuilding indexes on ElasticSearch due to
inserts/removes
MongoDB can connect with ElasticSearch for full-featured text
search if required
OVERVIEW
MONGODB
UPTIME
Availability % Downtime
per year
Downtime
per month
Downtime
per week
90% ("one nine") 36.5 d 72 h 16.8 h
95% 18.25 d 36 h 8.4 h
99.999% ("five
nines")
5.26 m 25.9 s 6.05 s
99.9999% ("six
nines")
31.5 s 2.59 s 604.8 ms
99.9999999%
("nine nines")
31.5569 ms 2.6297 ms 0.6048 ms
MONGODB CLUSTER
DATA DISTRIBUTION
* Purpose of Sharding
RANGE BASED SHARDING
MongoDB divides the data set into ranges determined by the
shard key values to provide range based partitioning.
* Range Based Sharding
HASH BASED SHARDING
MongoDB computes a hash of a field’s value, and then uses these
hashes to create chunks.
* Hash Based Sharding
HIGH AVAILABILITY
* Primary with Two Secondary Members
HIGH AVAILABILITY
* Primary with Two Secondary Members
HIGH AVAILABILITY
Number of
Members.
Majority Required to Elect
a New Primary.
Fault
Tolerance.
3 2 1
4 3 1
5 3 2
6 4 2
ESTIMATION
WORKING SET
50 events per second and 0.5KB each
Retention period is 90 days
Index factor is 40%
Backup factor is 50%
(effect disk size only)
WORKING SET
273 GB for 90 days
500 * 50 * 90 * 24 * 60 * 60 = 194.4 GB + 40%
91 GB for 30 days
46 GB for 15 days
DATA IN RAM
MongoDB tries to keep data in RAM (especially indexes)
For events it is hard to predict most recent data.
Only one assumption that can be taken -
older events will be less demand.
RAM & SHARDS
RAM 90 days
273 Gb
30 days
91 Gb
15 days
46 Gb
8 GB 35 shards 12 shards 5 shards
16 GB 18 shards 6 shards 3 shards
32 GB 9 shards 3 shards 2 shards
64 GB 5 shards 2 shards 1 shards
RAM & SERVERS
Days 8 Gb 16 Gb 32 Gb 64 Gb
90 175 90 45 25
30 60 30 15 10
15 25 15 10 5
* for 5 members Replica Set
RAM & SHARDS
Shards processes query in parallel
Each shard costs 3+ servers
More RAM - less shards
GOLDEN MEAN
5 member Replica Sets Disaster recovery and fail-over
30 days most recent
events
latest events are more demand
16 Gb RAM servers infrastructure limitation
30 data servers a lot of servers, but we should pay the
price ...
PERFORMANCE
DISK IO & RAM
4 GB RAM, 3 nodes
EVENTS LIFE CYCLE
EVENTS FLOW
Received (LogStash)
Buffered (Redis)
Modified (LogStash / MongoDB)
Stored (MongoDB)
Requested (User / REST API)
Processed (REST API / MongoDB)
Returned (REST API)
MUTATIONS
Done by LogStash
1. Inputs (rabbitmq, network, syslog, )
2. Codecs (json, multiline, )
3. Filters (json, csv, drop, )
4. Outputs (mongodb, elasticsearch, email, file, )
etc
etc
etc
etc
SUMMARY
MongoDB can scale simply
99,999% level of uptime and security
Smooth infrastructure integration
Customizability of components
Reasonable IO and hardware requirements
Out-of-box features & tools (aggregation, map-reduce, MMS &
OpsManager)
USEFUL LINKS
1.
2. (events and logs manager)
3. (async Python driver for Tornado and MongoDB)
4. (The Power of MongoDb & Elasticsearch
together)
5.
MongoDb Multi-Datacenter Deployments
LogStash
Motor
Mongoosastic
10gen Mongo-Connector
QUESTIONS
THANK YOU (:

MongoDB - Warehouse and Aggregator of Events

  • 1.
    MONGODB WAREHOUSE AND AGGREGATOROF EVENTS Kyiv Big Data & BI User Group May 14, 2015
  • 2.
  • 3.
    Big data isa broad term for data sets so large or complex that traditional data processing applications are inadequate @wikipedia
  • 4.
    Small Data iswhen is fit in RAM Big Data is when is crash because is not fit in RAM @devops_borat
  • 5.
    DESIGNATION Collect, aggregate andstore events from a different sources Provide load balancing, failover and disaster recovery within geographically distributed infrastructure
  • 6.
    CONDITIONS Constantly growing eventsrate Random intensive access with strict response time (OLTP) Strict retention period Existing infrastructure
  • 7.
    WHERE IS BIGDATA? Hugenumber and variety of event sources Events are concentrated in "one place" Response to query is strictly limited Returned data should be totally consistent
  • 8.
  • 9.
    E-L-K SOLUTION Events LogStashElasticSearch Kibana
  • 10.
  • 11.
    M-L-F SOLUTION Events LogStashMongoDB Flask (REST API)
  • 12.
  • 13.
    ELASTICSEARCH VS. MONGODB SearchEngine Document Store Java C++ 9+ supported languages 25+ supported languages (R as one of them) – Server-side scripting RESTful API/JSON API – – MapReduce – Security features
  • 14.
    ELASTICSEARCH VS. MONGODB Numberof shards defined on index creation Shards can be added dynamic Replicas synchronized with Primary node Secondaries synchronized with Primary node Replicas can be used for data retrieval Secondaries can be used for data retrieval
  • 15.
  • 16.
    ElasticSearch is asearch engine, but MongoDB is a documents store which is more applicable Custom REST API is required Easier infrastructure integration for MongoDB Overhead in rebuilding indexes on ElasticSearch due to inserts/removes MongoDB can connect with ElasticSearch for full-featured text search if required
  • 17.
  • 18.
  • 19.
    UPTIME Availability % Downtime per year Downtime permonth Downtime per week 90% ("one nine") 36.5 d 72 h 16.8 h 95% 18.25 d 36 h 8.4 h 99.999% ("five nines") 5.26 m 25.9 s 6.05 s 99.9999% ("six nines") 31.5 s 2.59 s 604.8 ms 99.9999999% ("nine nines") 31.5569 ms 2.6297 ms 0.6048 ms
  • 20.
  • 21.
  • 22.
    RANGE BASED SHARDING MongoDBdivides the data set into ranges determined by the shard key values to provide range based partitioning. * Range Based Sharding
  • 23.
    HASH BASED SHARDING MongoDBcomputes a hash of a field’s value, and then uses these hashes to create chunks. * Hash Based Sharding
  • 24.
    HIGH AVAILABILITY * Primarywith Two Secondary Members
  • 25.
    HIGH AVAILABILITY * Primarywith Two Secondary Members
  • 26.
    HIGH AVAILABILITY Number of Members. MajorityRequired to Elect a New Primary. Fault Tolerance. 3 2 1 4 3 1 5 3 2 6 4 2
  • 27.
  • 28.
    WORKING SET 50 eventsper second and 0.5KB each Retention period is 90 days Index factor is 40% Backup factor is 50% (effect disk size only)
  • 29.
    WORKING SET 273 GBfor 90 days 500 * 50 * 90 * 24 * 60 * 60 = 194.4 GB + 40% 91 GB for 30 days 46 GB for 15 days
  • 30.
    DATA IN RAM MongoDBtries to keep data in RAM (especially indexes) For events it is hard to predict most recent data. Only one assumption that can be taken - older events will be less demand.
  • 31.
    RAM & SHARDS RAM90 days 273 Gb 30 days 91 Gb 15 days 46 Gb 8 GB 35 shards 12 shards 5 shards 16 GB 18 shards 6 shards 3 shards 32 GB 9 shards 3 shards 2 shards 64 GB 5 shards 2 shards 1 shards
  • 32.
    RAM & SERVERS Days8 Gb 16 Gb 32 Gb 64 Gb 90 175 90 45 25 30 60 30 15 10 15 25 15 10 5 * for 5 members Replica Set
  • 33.
    RAM & SHARDS Shardsprocesses query in parallel Each shard costs 3+ servers More RAM - less shards
  • 34.
    GOLDEN MEAN 5 memberReplica Sets Disaster recovery and fail-over 30 days most recent events latest events are more demand 16 Gb RAM servers infrastructure limitation 30 data servers a lot of servers, but we should pay the price ...
  • 35.
  • 36.
    DISK IO &RAM 4 GB RAM, 3 nodes
  • 37.
  • 38.
    EVENTS FLOW Received (LogStash) Buffered(Redis) Modified (LogStash / MongoDB) Stored (MongoDB) Requested (User / REST API) Processed (REST API / MongoDB) Returned (REST API)
  • 39.
    MUTATIONS Done by LogStash 1.Inputs (rabbitmq, network, syslog, ) 2. Codecs (json, multiline, ) 3. Filters (json, csv, drop, ) 4. Outputs (mongodb, elasticsearch, email, file, ) etc etc etc etc
  • 40.
  • 41.
    MongoDB can scalesimply 99,999% level of uptime and security Smooth infrastructure integration Customizability of components Reasonable IO and hardware requirements Out-of-box features & tools (aggregation, map-reduce, MMS & OpsManager)
  • 42.
    USEFUL LINKS 1. 2. (eventsand logs manager) 3. (async Python driver for Tornado and MongoDB) 4. (The Power of MongoDb & Elasticsearch together) 5. MongoDb Multi-Datacenter Deployments LogStash Motor Mongoosastic 10gen Mongo-Connector
  • 43.
  • 44.