This document provides an overview of MongoDB sharding. It discusses how MongoDB addresses the need for horizontal scalability as data and throughput needs exceed the capabilities of a single machine. MongoDB uses sharding to partition data across multiple machines or shards. The key points are:
- MongoDB shards or partitions data by a shard key, distributing data ranges across shards for scalability.
- A configuration server stores metadata about sharding setup and chunk distribution. Mongos instances route queries to appropriate shards.
- MongoDB automatically splits and migrates chunks as data grows to balance load across shards.
- Setting up sharding in MongoDB requires minimal configuration and provides a consistent interface like a single database.
Intro to MongoDB
Get a jumpstart on MongoDB, use cases, and next steps for building your first app with Buzz Moschetti, MongoDB Enterprise Architect.
@BuzzMoschetti
Intro to MongoDB
Get a jumpstart on MongoDB, use cases, and next steps for building your first app with Buzz Moschetti, MongoDB Enterprise Architect.
@BuzzMoschetti
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
A brief overview of caching mechanisms in a web application. Taking a look at the different layers of caching and how to utilize them in a PHP code base. We also compare Redis and MemCached discussing their advantages and disadvantages.
MongoDB was designed for humongous amounts of data, with the ability to scale horizontally via sharding. In this session, we’ll look at MongoDB’s approach to partitioning data, and the architecture of a sharded system. We’ll walk you through configuration of a sharded system, and look at how data is balanced across servers and requests are routed.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
A brief overview of caching mechanisms in a web application. Taking a look at the different layers of caching and how to utilize them in a PHP code base. We also compare Redis and MemCached discussing their advantages and disadvantages.
MongoDB was designed for humongous amounts of data, with the ability to scale horizontally via sharding. In this session, we’ll look at MongoDB’s approach to partitioning data, and the architecture of a sharded system. We’ll walk you through configuration of a sharded system, and look at how data is balanced across servers and requests are routed.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
Learn about the various approaches to sharding your data with MongoDB. This presentation will help you answer questions such as when to shard and how to choose a shard key.
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Back to Basics 2017: Introduction to ShardingMongoDB
Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations by providing the capability for horizontal scaling.
The AWS Workshop Series Online is a series of live webinars designed for IT professionals who are looking to leverage the AWS Cloud to build and transform their business, are new to the AWS Cloud or looking to further expand their skills and expertise. In this series, we will cover : "Build a Website on AWS for Your First 10 Million Users".
For the first time this year, 10gen will be offering a track completely dedicated to Operations at MongoSV, 10gen's annual MongoDB user conference on December 4. Learn more at MongoSV.com
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
MongoDB San Francisco 2013: Basic Sharding in MongoDB presented by Brandon Bl...MongoDB
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Webinar: Serie Operazioni per la vostra applicazione - Sessione 6 - Installar...MongoDB
Fai del 2014 l'anno in cui imparare qualcosa di nuovo. Unisciti alla nostra serie di webinar in 8 parti e scopri quanto è facile sviluppare applicazioni con MongoDB. Le sessioni, tenute dai nostri Solutions Architects, vi insegneranno le basi dalla A alla Z, condivideranno le best practice e i trucchi per partire con confidenza. Le sessioni saranno completamente in italiano.
A questo punto avremo fatto l’applicazione. Ora dobbiamo metterla in produzione. Illustreremo le varie architetture per l’alta affidabilità e per la scalabilità orizzontale.
La serie comprende le seguenti sessioni:
10 Giugno 2014 Serie Operazioni per la vostra applicazione - Sessione 7 - Backup e Disaster Recovery:
Questo webinar parlerà delle varie opzioni di backup e di restore. Impara cosa dovresti fare in caso di un guasto e come effettuare le operazioni di backup e recovery dai dati nelle vostre applicazioni.
17 Giugno 2014: Serie Operazioni per la vostra applicazione - Sessione 8 - Monitoraggio e Performance Tuning:
L’ultimo webinar della serie discuterà quali metriche sono importanti e come gestire e monitorare la vostra applicazione per migliorare le performance.
Massimo Brignoli: About the speaker
Massimo ha 44 anni e vive a Milano. Ha lavorato nell’IT per 23 anni per aziende di trasporti, società web e database company. Nel 1998 è entrato una una piccola startup come sviluppatore aiutandola a diventare il più importante portale web italiano, venduto 3 anni più tardi per 700 milioni di dollari. E’ entrato a lavorare in MySQL come pre-vendita viaggiando in tutto il mondo e aiutando le società telecom ad adottare MySQL Cluster. Nel 2012 è entrato in SkySQL come product manager, seguendo l’integrazione con MariaDB e successivamente ha deciso di entrare in MongoDB per seguire nuove sfide professionali. Attualmente e’ Senior Solutions Architect.
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
During this talk we'll navigate through a customer's journey as they migrate an existing MongoDB deployment to MongoDB Atlas. While the migration itself can be as simple as a few clicks, the prep/post effort requires due diligence to ensure a smooth transfer. We'll cover these steps in detail and provide best practices. In addition, we’ll provide an overview of what to consider when migrating other cloud data stores, traditional databases and MongoDB imitations to MongoDB Atlas.
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB
MongoDB Kubernetes operator and MongoDB Open Service Broker are ready for production operations. Learn about how MongoDB can be used with the most popular container orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications. A demo will show you how easy it is to enable MongoDB clusters as an External Service using the Open Service Broker API for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB
Are you new to schema design for MongoDB, or are you looking for a more complete or agile process than what you are following currently? In this talk, we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
Humana, like many companies, is tackling the challenge of creating real-time insights from data that is diverse and rapidly changing. This is our journey of how we used MongoDB to combined traditional batch approaches with streaming technologies to provide continues alerting capabilities from real-time data streams.
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
Time series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time series data can enable organizations to better detect and respond to events ahead of their competitors or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe.
This talk covers:
Common components of an IoT solution
The challenges involved with managing time-series data in IoT applications
Different schema designs, and how these affect memory and disk utilization – two critical factors in application performance.
How to query, analyze and present IoT time-series data using MongoDB Compass and MongoDB Charts
At the end of the session, you will have a better understanding of key best practices in managing IoT time-series data with MongoDB.
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
Our clients have unique use cases and data patterns that mandate the choice of a particular strategy. To implement these strategies, it is mandatory that we unlearn a lot of relational concepts while designing and rapidly developing efficient applications on NoSQL. In this session, we will talk about some of our client use cases, the strategies we have adopted, and the features of MongoDB that assisted in implementing these strategies.
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
Encryption is not a new concept to MongoDB. Encryption may occur in-transit (with TLS) and at-rest (with the encrypted storage engine). But MongoDB 4.2 introduces support for Client Side Encryption, ensuring the most sensitive data is encrypted before ever leaving the client application. Even full access to your MongoDB servers is not enough to decrypt this data. And better yet, Client Side Encryption can be enabled at the "flick of a switch".
This session covers using Client Side Encryption in your applications. This includes the necessary setup, how to encrypt data without sacrificing queryability, and what trade-offs to expect.
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
MongoDB Kubernetes operator is ready for prime-time. Learn about how MongoDB can be used with most popular orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications.
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
When you need to model data, is your first instinct to start breaking it down into rows and columns? Mine used to be too. When you want to develop apps in a modern, agile way, NoSQL databases can be the best option. Come to this talk to learn how to take advantage of all that NoSQL databases have to offer and discover the benefits of changing your mindset from the legacy, tabular way of modeling data. We’ll compare and contrast the terms and concepts in SQL databases and MongoDB, explain the benefits of using MongoDB compared to SQL databases, and walk through data modeling basics so you feel confident as you begin using MongoDB.
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
Query performance should be the unsung hero of an application, but without proper configuration, can become a constant headache. When used properly, MongoDB provides extremely powerful querying capabilities. In this session, we'll discuss concepts like equality, sort, range, managing query predicates versus sequential predicates, and best practices to building multikey indexes.
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
Are you new to schema design for MongoDB, or are you looking for a more complete or agile process than what you are following currently? In this talk, we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented. In addition, we'll discuss future plans and opportunities and offer ample Q&A time with the engineers on the project.
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
Virtual assistants are becoming the new norm when it comes to daily life, with Amazon’s Alexa being the leader in the space. As a developer, not only do you need to make web and mobile compliant applications, but you need to be able to support virtual assistants like Alexa. However, the process isn’t quite the same between the platforms.
How do you handle requests? Where do you store your data and work with it to create meaningful responses with little delay? How much of your code needs to change between platforms?
In this session we’ll see how to design and develop applications known as Skills for Amazon Alexa powered devices using the Go programming language and MongoDB.
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
aux Core Data, appréciée par des centaines de milliers de développeurs. Apprenez ce qui rend Realm spécial et comment il peut être utilisé pour créer de meilleures applications plus rapidement.
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
Il n’a jamais été aussi facile de commander en ligne et de se faire livrer en moins de 48h très souvent gratuitement. Cette simplicité d’usage cache un marché complexe de plus de 8000 milliards de $.
La data est bien connu du monde de la Supply Chain (itinéraires, informations sur les marchandises, douanes,…), mais la valeur de ces données opérationnelles reste peu exploitée. En alliant expertise métier et Data Science, Upply redéfinit les fondamentaux de la Supply Chain en proposant à chacun des acteurs de surmonter la volatilité et l’inefficacité du marché.
4. Examining Growth
• User Growth
– 1995: 0.4% of the world‟s population
– Today: 30% of the world is online (~2.2B)
– Emerging Markets & Mobile
• Data Set Growth
– Facebook‟s data set is around 100 petabytes
– 4 billion photos taken in the last year (4x a decade ago)
10. Data Store Scalability Today
• MongoDB Auto-Sharding
• A data store that is
– Free
– Publicly available
– Open source (https://github.com/mongodb/mongo)
– Horizontally scalable
– Application independent
12. Partitioning
• User defines shard key
• Shard key defines range of data
• Key space is like points on a line
• Range is a segment of that line
13. Data Distribution
• Initially 1 chunk
• Default max chunk size: 64mb
• MongoDB automatically splits & migrates chunks
when max reached
14. Routing and Balancing
• Queries routed to specific
shards
• MongoDB balances cluster
• MongoDB migrates data to
new nodes
15. MongoDB Auto-Sharding
• Minimal effort required
– Same interface as single mongod
• Two steps
– Enable Sharding for a database
– Shard collection within database
22. Example Cluster
• Don’t use this setup in production!
- Only one Config server (No Fault Tolerance)
- Shard not in a replica set (Low Availability)
- Only one mongos and shard (No Performance Improvement)
- Useful for development or demonstrating configuration mechanics
23. Starting the Configuration Server
• mongod --configsvr
• Starts a configuration server on the default port (27019)
24. Start the mongos Router
• mongos --configdb <hostname>:27019
• For 3 configuration servers:
mongos --configdb <host1>:<port1>,<host2>:<port2>,<host3>:<port3>
• This is always how to start a new mongos, even if the cluster
is already running
25. Start the shard database
• mongod --shardsvr
• Starts a mongod with the default shard port (27018)
• Shard is not yet connected to the rest of the cluster
• Shard may have already been running in production
26. Add the Shard
• On mongos:
- sh.addShard(„<host>:27018‟)
• Adding a replica set:
- sh.addShard(„<rsname>/<seedlist>‟)
27. Verify that the shard was added
• db.runCommand({ listshards:1 })
{ "shards" :
[{"_id”: "shard0000”,"host”: ”<hostname>:27018” } ],
"ok" : 1
}
28. Enabling Sharding
• Enable sharding on a database
sh.enableSharding(“<dbname>”)
• Shard a collection with the given key
sh.shardCollection(“<dbname>.people”,{“country”:1})
• Use a compound shard key to prevent duplicates
sh.shardCollection(“<dbname>.cars”,{“year”:1, ”uniqueid”:1})
32. Chunk splitting
• A chunk is split once it exceeds the maximum size
• There is no split point if all documents have the same shard
key
• Chunk split is a logical operation (no data is moved)
33. Balancing
• Balancer is running on mongos
• Once the difference in chunks between the most dense shard
and the least dense shard is above the migration threshold, a
balancing round starts
34. Acquiring the Balancer Lock
• The balancer on mongos takes out a “balancer lock”
• To see the status of these locks:
use config
db.locks.find({ _id: “balancer” })
35. Moving the chunk
• The mongos sends a moveChunk command to source shard
• The source shard then notifies destination shard
• Destination shard starts pulling documents from source shard
36. Committing Migration
• When complete, destination shard updates config server
- Provides new locations of the chunks
37. Cleanup
• Source shard deletes moved data
- Must wait for open cursors to either close or time out
- NoTimeout cursors may prevent the release of the lock
• The mongos releases the balancer lock after old chunks are
deleted
58. Shard Key
• Shard key is immutable
• Shard key values are immutable
• Shard key must be indexed
• Shard key limited to 512 bytes in size
• Shard key used to route queries
– Choose a field commonly used in queries
• Only shard key can be unique across shards
– `_id` field is only unique within individual shard
59. Shard Key Considerations
• Cardinality
• Write Distribution
• Query Isolation
• Reliability
• Index Locality
Ops will be most interested in ConfigurationDev will be most interested in Mechanics
Take this slide and toss it out the window for the combined sharding talk. It’s a nice intro, but it’s more effective to spend time talking about tag aware and shard keys if you’re trying to go beyond the intro talk.
Indexes should be contained in working set.
From mainframes, to RAC Oracle servers.. People solved problems by adding more resources to a single machine.
Google was the first to demonstrate that a large scale operation could be had with high performance on commodity hardwareBuild - Document oriented database maps perfectly to object oriented languagesScale - MongoDB presents clear path to scalability that isn't ops intensive - Provides same interface for sharded cluster as single instance
In 2005, two ways to achieve datastore scalability:Lots of money to purchase custom hardwareLots of money to develop custom software
MongoDB 2.2 and later only need <host> and <port> for one member of the replica set
This can be skipped for the intro talk, but might be good to include if you’re doing the combined sharding talk. Totally optional, you don’t really have enough time to do this topic justice but it might be worth a mention.
Quick review from earlier.
Once chunk size is reached, mongos asks mongod to split a chunk + internal function called splitVector()mongod counts number of documents on each side of split + based on avg. document size `db.stats()`Chunk split is a **logical** operation (no data has moved)Max on first chunk should be 14
Balancer lock actually held on config server.
Moved chunk on shard2 should be gray
How do the other mongoses know that their configuration is out of date? When does the chunk version on the shard itself get updated?
The mongos does not have to load the whole set into memory since each shard sorts locally. The mongos can just getMore from the shards as needed and incrementally return the results to the client.
_id could be unique across shards if used as shard key.we could only guarantee uniqueness of (any) attributes if the keys are used as shard keys with unique attribute equals true
Cardinality – Can your data be broken down enough?Query Isolation - query targeting to a specific shardReliability – shard outagesA good shard key can:Optimize routingMinimize (unnecessary) trafficAllow best scaling
This section can be skipped for the intro talk, but might be good to include if you’re doing the combined sharding talk.
Reliability = if we lost an entire shard what would be the effect on the system?For most frequently used/largest indexes, how are they partitioned across shards?
Reliability = if we lost an entire shard what would be the effect on the system?For most frequently used/largest indexes, how are they partitioned across shards?
Reliability = if we lost an entire shard what would be the effect on the system?For most frequently used/largest indexes, how are they partitioned across shards?
Reliability = if we lost an entire shard what would be the effect on the system?For most frequently used/largest indexes, how are they partitioned across shards?
Compound shard key, utilize already existing (used) index, partition chunk for "power" users.For those users more than one shard *may* need to be queried in a targeted query.Reliability = if we lost an entire shard what would be the effect on the system?For most frequently used/largest indexes, how are they partitioned across shards?