MongoDB is a document database that provides a more flexible schema than relational databases. It allows embedding related data and easier updates than relational databases with object-relational mapping. MongoDB scales horizontally through sharding and provides high availability through replica sets. It supports different consistency models including eventual and strong consistency through write concerns and read preferences.
2. Who Am I?
• Solutions Architect/Evangelist in MongoDB Inc.
• 24 years of experience in databases and software
development
• Former MySQL employee
• Previous life: web,web,web
4. Understanding Big Data – It’s Not Very“Big”
from Big Data Executive Summary – 50+ top executives from Government and F500 firms
64% - Ingest diverse,
new data in real-time
15% - More than 100TB
of data
20% - Less than 100TB
(average of all? <20TB)
5. “I have not failed. I've just found 10,000 ways that won't work.”
― Thomas A. Edison
18. To provide the best database for how we build and
run apps today
MongoDB Vision
Build
– New and complex data
– Flexible
– New languages
– Faster development
Run
– Big Data scalability
– Real-time
– Commodity hardware
– Cloud
19. Enterprise Big Data Stack
EDWHadoop
Management&Monitoring
Security&Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Online Data Offline Data
22. Key → Value
• One-dimensional storage
• Single value is a blob
• Query on key only
• No schema
• Value cannot be updated,only replaced
Key Blob
23. Relational/Wide Column
• Two-dimensional storage (tuples)
• Each field contains a single value
• Query on anyfield
• Very structured schema (table)
• In-place updates
• Normalization process requires many tables, joins,
indexes,and poor data locality
Primary
Key
24. Document
• N-dimensional storage
• Each field can contain 0,1,
many,or embedded values
• Query on anyfield & level
• Flexible schema
• Inline updates *
• Embedding related data has optimal data locality,
requires fewer indexes,has better performance
_id
26. Document Model Benefits
• Agility and flexibility
– Data models can evolve easily
– Companies can adapt to changes quickly
• Intuitive,natural data representation
– Developers are more productive
– Manytypes of applications are a good fit
• Reduces the need for joins,disk seeks
– Programming is more simple
– Performance can be delivered at scale
29. Automatic Sharding
• Three types of sharding: hash-based, range-based, tag-
aware!
• Increase or decrease capacity as you go!
• Automatic balancing
30. Query Routing
• Multiple query optimization models!
• Each sharding option appropriate for different apps!
31. HighAvailability–Ensure application availabilityduring many
types of failures
!
Disaster Recovery–Address the RTO and RPO goals for business
continuity
!
Maintenance –Perform upgrades and other maintenance
operations with no application downtime
Availability Considerations
32. Replica Sets
• Replica Set – two or more copies!
• “Self-healing” shard!
• Addresses many concerns:!
- High Availability!
- Disaster Recovery!
- Maintenance
40. Tagging
• Control where data is written to,and read from
• Each member can have one or more tags
– tags: {dc:"ny"}
– tags: {dc:"ny",
subnet:"192.168",
rack:"row3rk7"}
• Replica set defines rules for write concerns
• Rules can change without changing app code
43. Read Preference Modes
• 5 modes
– primary(only)-Default
– primaryPreferred
– secondary
– secondaryPreferred
– Nearest
!
When more than one node is possible,closest node is used for
reads (all modes but primary)
44. Single Data Center
• Automated failover !
• Tolerates server failures!
• Tolerates rack failures!
• Number of replicas
defines failure tolerance
Primary –A Primary – B Primary – C
Secondary –A Secondary –ASecondary – B
Secondary – BSecondary – CSecondary – C
45. Active/Standby Data Center
• Tolerates server and rack failure!
• Standby data center
Data Center - West
Primary –A Primary – B Primary – C
Secondary –ASecondary – B Secondary – C
Data Center - East
Secondary –A Secondary – B Secondary – C
46. Active/Active Data Center
• Tolerates server, rack, data center failures, network
partitions
Data Center - West
Primary –A Primary – B Primary – C
Secondary –A Secondary – BSecondary – C
Data Center - East
Secondary –A Secondary – B Secondary – C
Secondary – B Secondary – C Secondary –A
Data Center - Central
Arbiter –A Arbiter – B Arbiter – C
50. High Volume Data Feeds
••More machine forms, sensors & data
••Variably structured
Machine
Generated Data
••High frequency trading
••Daily closing price
Securities Data
••Multiple data sources
••Each changes their format consistently
••Student Scores, ISP logs
Social Media /
General Public
51. Operational Intelligence
••Large volume of users
••Very strict latency requirements
••Sentiment Analysis
Ad Targeting
••Expose data to millions of customers
••Reports on large volumes of data
••Reports that update in real time
Real time
dashboards
••Join the conversation
••Catered Games
••Customized Surveys
Social Media
Monitoring
52. Metadata
••Diverse product portfolio
••Complex querying and filtering
••Multi-faceted product attributes
Product
Catalogue
••Data mining
••Call records
••Insurance Claims
Data analysis
••Retina Scans
••Fingerprints
Biometric
53. Content Management
••Comments and user generated content
••Personalization of content and layout
News Site
••Generate layout on the fly
••No need to cache static pages
Multi-device
rendering
••Store large objects
••Simpler modeling of metadata
Sharing