MongoDB Knowledge Shareing

901 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
901
On SlideShare
0
From Embeds
0
Number of Embeds
318
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • MongoDB起源于2007年10gen公司的一个项目,该项目的目的是创建一个类似于谷歌AppEngine的Paas平台,用来自动管理软硬件基础设施,让开发者将精力集中在程序设计上,但是这样也剥夺了开发人员很多的自主权,反响不是很好。原本的paas平台由应用服务和数据库组成,发现人们对数据库更感兴趣,于是专注于数据库部分,也就是现在的MongoDB
    Dwight Merriman & Kevin Ryan
    MongoDB成为2013年大数据领域的创业新贵。这家成立于2007年的企业在近期获得了2.31亿美元的融资,也因此成为首个身价超过10亿美元的开源创业企业。目前,业内对该公司资产的估值高达12亿美元
  • To support hash based sharding, MongoDB provides a hashed index type
    The sparse property of an index ensures that the index only contain entries for documents that have the indexed field
  • To support hash based sharding, MongoDB provides a hashed index type
    The sparse property of an index ensures that the index only contain entries for documents that have the indexed field
  • MongoDB Knowledge Shareing

    1. 1. Mongo Philip.Zhong/Chen.Tao/Leaf .Zhu, 2014
    2. 2. Agenda • What’s Mongo? • Mongo Advantages & Limitations • Mongo Case Studies
    3. 3. What’s Mongo? http://www.mongodb.org/ $1,200,000,000 (2007-2013) http://www.mongodb.com/ Red Hat (1993-2013) $ 16.75 Billion $ 30.0+ Billion
    4. 4. What’s Mongo?  MongoDB (from "humongous") is an open-source document database, and the leading NoSQL database. Written in C++  The most SQL-like NoSQL.  Mongo is a Open, Schemaless, Document-Oriented NoSql data base with Rich Query, High Performance, High Availbility, High Scalibility, High Flexibility
    5. 5. 1. Document Data Model. Document, BSON. 2. Rich Query Model. Full Index, Various Query Type. 3. Idiomatic Drivers. Over 17 language drivers support. 4. Horizontal Scalability. Easy to append capacity 5. High Availability. HA, Journal, Auto-Recover. 6. In-Memory Performance. Memeory-Mapped Files, read/write in RAM. 7. Flexibility. Schema-free, multi-datacenter deployments, tunable consistency, widly used across many industries.
    6. 6. Data Model
    7. 7. Data Model • Max BSON Document Size 16M • Nested Depth for BSON Document 100Level • Document-level Atomic operation
    8. 8. Data Operation
    9. 9. Query
    10. 10. Query Type 1. Key-value 2. Range queries. 3. Text Search AND, OR, NOT etc. 4. Aggregation count, min, max, average etc. 5. MapReduce
    11. 11. Cursor  Query returns a cursor  Iterate the cursor to get results  Return 101 results or size less than 1M bytes, overrided by batchSize or limit, not exceeds 16M
    12. 12. Create/Update/Delete
    13. 13. Write Concern  Error Ignored  Unacknowledged  Acknowledged  Journaled
    14. 14. Index 1. Single Field Indexes 2. Compound Indexes. 3. Array Indexes. 4. Geospatial Indexes. 5. Hash Indexes. 1. Unique Indexes 6. Text Search Indexes (V2.4, Beta) 2. Spars Index
    15. 15. Index  At least 8KB for each index.  Negative performance impact for write operations. Expensive for high write-to-read ratio collection.  benefit high read-to-write ratio collections.  Consumes disk space and memory. Carefully tracked and plan
    16. 16. Availability
    17. 17. RDBMS Replication
    18. 18. Mongo Replication  Have up to 12 Mongod instances  Have a Primary member, which receives write requests
    19. 19. Mongo Failover
    20. 20. Secondary Hidden
    21. 21. Read Preference
    22. 22. Scalability
    23. 23. Basic Concepts • Config Servers Shards Replica Mongos Set          Contain APP requests a group of mongod Exist in sets of three Process fractions of global requests to processes Maintain metadata Direct data Are replica Includes sets in shards Primary and Are mongod instances production Secondarys to clients Direct results Can be queried Exist as 1+ directly by clients (not Are mongos instances recommended) Cache metadata
    24. 24. Range Based Sharding
    25. 25. Hash Based Sharding
    26. 26. Splitting and Balancing
    27. 27. Data Store As Service
    28. 28. Case Study
    29. 29. Schema Design • Remember, "schemaless" doesn't mean you don't need to design your schema! • • • • • • • Considerations to avoid the pitfalls of MongoDB schema design: 1. Avoid growing documents 3. Pay attention to BSON data types 5. Field names take up space 6. Consider using _id for your own purposes 7. Can you use covered indexes? 8. Use collections and databases to your advantage • • Test everything Schema design effect performance Schema design effect infrastructure: RAM > indexes + hot data = better performance
    30. 30. MongoDB for MDS – Sharding Strategy • When need shard? – – – your data set approaches or exceeds the storage capacity of a single MongoDB instance. the size of your system’s active working set will soon exceed the capacity of your system’s maximumRAM. a single MongoDB instance cannot meet the demands of your write operations, and all other approaches have not reduced contention. • The considerations for sharding – – – – Multiple ways to model a domain problem Understand the key uses cases of your app Balance between ease of query vs. ease of write Random I/O should be avoided • Meeting behavior and sharding consideration(From 10G) – – – – Schedule meeting - ~800K meetings write/day ~20% instant meetings Scalability best practice: Don’t scale by using replication. Scale by using local read nodes. Recommend to implement local write to meet JOIN meetings use case requirements
    31. 31. Cross DC latency Testing Local vs Remote Write/Read Latency Test: Scenario: Create two shards, each with three member replica sets. Make sure that Primary node of one runs on local DC(SJ), where as Primary of the second runs on remote DC(TX). Run small number of writes from local DC to Replica1 Primary and then run the same against Replica2 Primary. Writeconcern = majority. Average object size is 1500 bytes. (ping time 46 ms from local DC(SJ) to remote DC(TX). Local vs Remote Insert Tests (YCSB test):
    32. 32. Replication delay cross DC • • Repication Lag between data centers: Scenario: On the local DC(SJ), where the replication Primary is running, insert 500 records at a time, upto a total of 550,000 records. Record the record count and current timestamp at the end of every 500 insertions. Note that this is a single threaded operation and only one process is inserting these records. On the remote DC(TX), where the 3rd secondary is running (this node is the least nearest of all the secondaries and so, is not part of the initial write), in a loop keep getting the db.collection.count() and whenever the count returns a multiple of 500, record the count and the current timestamp. Use the data collected on Primary and remote secondary, compute the replication delay.
    33. 33. MongoDB for MDS – Sharding Goals: - write to a shard primary node with physical proximity to the application server - keep the shard primary node in close proximity to the application server [monitor the primary node of the replica set and if possible, restore the primary t - reduce 'scatter/gather' on reads - use smart shard keys Solution: Add a geo-location based field in the schema, create a shard index based on that field, assign a tag to each shard and assign specific shard index field ra e.g., Say we can add a 'DC' field into our collection. Assuming that the application somehow knows the data center it is running on, it can use this value for Associate the tag ranges to specific tagged shard. Inferred Technical Requirements 1. MongoDB Sharding (shard keys: region + siteId + userId, region + siteId + meetingUUID) to support 3 regions (US, EMEA, APAC) 2. Sharding by siteId + userId or siteId + meetingUUID allows hosts from the same company (siteId), same region to create meetings in different shards. if we need to scale horizontally, the shard config will add another shard for the same siteId 3. Based on shard keys, we can support the requirements of local writes, local reads 4. Replication requirement - replicating 600,000 meetings/day within 15 minutes between 2 nodes (remark: early benchmarking shows 11M meetings data replicated across 3 sites within 4 minutes) 5. Availability requirement - a primary node fails over to a secondary node within the same data center = < 30 sec; a primary node fails over to a secondary in a different data center = < 10 minutes
    34. 34. MongoDB使用案例 • • BillRun 计费系统 奥弗•科恩发布下一代的开源计费解决方案BillRun ,此方案利用MongoDB作为其后端存储。此计费系统已经运行于以色列发展最快的移动运 营商的产品环境,每个月能处理超过500M的呼叫数据记录CDR。 • • • • • 视觉中国 存储comments/feed/full text search 问题: Fail-over失效,由于没有正确配置replica set,至少1 primary+2 sencondary+n arbiter. Out of Memory导致宕机 --增加内存,使用正确驱动(非开发版) • • 优酷 优酷的在线评论业务已部分迁移到MongoDB,运营数据分析及挖掘处理前在使用Hadoop/HBase; • • • • 奇虎360 Document>100Million 问题 Time out (数据超过内存,随机读写,moving chunk时间) Solution: 增大内存(甚至用SSD),节省空间使用(schema refactor);调整balancer工作时间,避免高峰 • • • • Mailbox 100 Million Messages Per Day, store email and related data by MongoDB https://tech.dropbox.com/2013/09/scaling-mongodb-at-mailbox/ Lesson: write lock contention Solution: separate hot collection to standalone cluster, sharding • • • Other 百度开放云-云数据库 非关系型数据库用了mongoDB有很多中小开发者基于mongodb进行开发 Amazon E2: MongoDB后台数据库,如果其上应用data
    35. 35. Q&A

    ×