Building Highly Flexible, High Performance Query EnginesMapR Technologies
The document discusses Apache Drill, an open source SQL query engine for analysis of data in Hadoop. It provides an overview of Drill, highlighting its ability to handle flexible schemas, analyze semi-structured and nested data from NoSQL sources, and integrate with existing business intelligence tools through a familiar SQL interface. The document also notes that traditional SQL approaches do not always work well for new big data applications and data models, and that Drill aims to address these challenges.
The document provides an introduction and overview of Neo4j and Cypher. It discusses that Cypher is the declarative query language for Neo4j, which focuses on what to retrieve rather than how. Key clauses of Cypher like MATCH, WHERE, and RETURN are explained, as well as how it uses a dataflow approach. The document also demonstrates some basic Cypher queries and how to visualize the graph database structure in Cypher.
This document provides an overview of modeling user data as a graph for the purposes of recommendation engines and fraud detection. It discusses modeling normal user behavior to predict edges and attributes in order to detect anomalous behavior. It also covers using graph queries and relationships to identify probable fraud rings and suspicious cohabiting accounts. Detecting fraud and finding recommendations are described as two sides of understanding user behaviors and connections within the graph.
This document provides an introduction and overview of Neo4j and graph databases. It begins with an explanation of the limitations of relational databases in modeling relationships and includes slides on Neo4j's native graph data model and architecture. Additional slides cover Neo4j use cases, modeling with graphs, the Neo4j platform and features like the cloud, drivers, and visualization tools. The document concludes with examples of recommender systems queries in Cypher.
Elasticsearch is a distributed, RESTful search and analytics engine that allows for fast searching, filtering, and analysis of large volumes of data. It is document-based and stores structured and unstructured data in JSON documents within configurable indices. Documents can be queried using a simple query string syntax or more complex queries using the domain-specific query language. Elasticsearch also supports analytics through aggregations that can perform metrics and bucketing operations on document fields.
The document discusses NoSQL databases, including what NoSQL is, various data models like key-value, document, column-family and graph databases. It describes types of NoSQL databases and examples. Reasons for using NoSQL databases are provided, such as their ability to handle schema migrations easily, support multiple data formats, avoid impedance mismatch and automatically shard data across servers.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
Building Highly Flexible, High Performance Query EnginesMapR Technologies
The document discusses Apache Drill, an open source SQL query engine for analysis of data in Hadoop. It provides an overview of Drill, highlighting its ability to handle flexible schemas, analyze semi-structured and nested data from NoSQL sources, and integrate with existing business intelligence tools through a familiar SQL interface. The document also notes that traditional SQL approaches do not always work well for new big data applications and data models, and that Drill aims to address these challenges.
The document provides an introduction and overview of Neo4j and Cypher. It discusses that Cypher is the declarative query language for Neo4j, which focuses on what to retrieve rather than how. Key clauses of Cypher like MATCH, WHERE, and RETURN are explained, as well as how it uses a dataflow approach. The document also demonstrates some basic Cypher queries and how to visualize the graph database structure in Cypher.
This document provides an overview of modeling user data as a graph for the purposes of recommendation engines and fraud detection. It discusses modeling normal user behavior to predict edges and attributes in order to detect anomalous behavior. It also covers using graph queries and relationships to identify probable fraud rings and suspicious cohabiting accounts. Detecting fraud and finding recommendations are described as two sides of understanding user behaviors and connections within the graph.
This document provides an introduction and overview of Neo4j and graph databases. It begins with an explanation of the limitations of relational databases in modeling relationships and includes slides on Neo4j's native graph data model and architecture. Additional slides cover Neo4j use cases, modeling with graphs, the Neo4j platform and features like the cloud, drivers, and visualization tools. The document concludes with examples of recommender systems queries in Cypher.
Elasticsearch is a distributed, RESTful search and analytics engine that allows for fast searching, filtering, and analysis of large volumes of data. It is document-based and stores structured and unstructured data in JSON documents within configurable indices. Documents can be queried using a simple query string syntax or more complex queries using the domain-specific query language. Elasticsearch also supports analytics through aggregations that can perform metrics and bucketing operations on document fields.
The document discusses NoSQL databases, including what NoSQL is, various data models like key-value, document, column-family and graph databases. It describes types of NoSQL databases and examples. Reasons for using NoSQL databases are provided, such as their ability to handle schema migrations easily, support multiple data formats, avoid impedance mismatch and automatically shard data across servers.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
Learn MongoDB online at Easylearning.guru. We offer instructor led online training and Life Time LMS (Learning Management System) Access. Join Our Free Live Demo Classes of MongoDB.
The document provides an overview of MongoDB sharding, including:
- Sharding allows horizontal scaling of data by partitioning a database across multiple servers or shards.
- The MongoDB sharding architecture consists of shards to hold data, config servers to store metadata, and mongos processes to route requests.
- Data is partitioned into chunks based on a shard key and chunks can move between shards as the data distribution changes.
This document discusses the MongoDB shell, which is a JavaScript interpreter with built-in support for MongoDB. It can be used for interactive development, testing, administration, and learning MongoDB. The shell allows running queries, saving documents, and has features like command completion, history, and helpers for common tasks. While powerful, it has some limitations like performance and lack of modern JavaScript features. Overall, the document recommends using the shell to better understand and work with MongoDB.
MongoDB training for java software engineersMoshe Kaplan
This document discusses MongoDB and its usage for Java software engineers. It begins with an introduction to MongoDB and discusses how it provides a document-oriented database that scales well for applications. Several examples of companies using MongoDB are also provided, such as Moovit and MediSafe. The document then covers various topics related to using MongoDB such as installation, querying data, data modeling differences from relational databases, migration from SQL to MongoDB, and challenges of MongoDB's schemaless design.
Sharding is a technique for partitioning and distributing data across multiple servers to enable scaling to large data volumes and workloads. It involves defining a shard key to partition data into chunks that are distributed across shards. The document discusses different types of sharding strategies like range, hash, and tag-aware sharding and how they apply to different use cases around scale, geo-distribution, and hardware optimization. It also covers best practices for building a sharded cluster like pre-splitting data, capacity planning, and using tools like MongoDB Management Service for production operations.
This document discusses MongoDB and compares it to traditional databases. It outlines how MongoDB is document-oriented, easy to use, fast, scalable, and has built-in security features. It also describes a collaboration between MongoDB developers 10gen and healthcare experts OSEHRA to build a prototype using MongoDB to store and manage electronic healthcare records.
MongoDB is a document-oriented database that bridges the gap between key-value stores and traditional relational databases. It uses collections of BSON documents (similar to JSON objects) rather than tables. Queries in MongoDB use JSON-like documents rather than SQL. MongoDB also supports features like embedded documents, dynamic schemas, rich queries and indexing to improve performance.
The document discusses tips and tricks for using the MongoDB shell. It describes what the shell is, advantages like debugging queries and administration, and disadvantages like numbers and dates. It provides examples of shell commands and functions for tasks like loading scripts, running commands, profiling, and administration. It also discusses configuration options and ways to improve the shell experience like keyboard shortcuts and the .mongorc.js file.
Training MongoDB - Monitoring and OperabilityNicolas Motte
The document provides an overview of MongoDB, including:
- MongoDB uses a document-based data model and stores data in JSON-like documents using BSON format.
- The main MongoDB processes are mongod, which stores the data, mongos for routing queries, and config servers for metadata.
- MongoDB supports scaling out by adding mongod processes and sharding data.
- The document also discusses Amadeus' architecture using MongoDB, with applications connecting to mongos processes which route to MongoDB clusters.
The document discusses MongoDB's aggregation framework. It provides examples of using aggregation operations like $group, $match, $sort, $limit, $project, and $unwind to summarize and analyze data. Specifically, it shows examples of counting campgrounds by rating, finding the top cities for campgrounds, finding the most appearing comic book characters, and analyzing data on "geeks" in Paris. The document also discusses how indexes can optimize aggregation pipelines and how pipelines can be split between shards for performance.
MongoDB: Advance concepts - Replication and ShardingKnoldus Inc.
The document discusses MongoDB replication and sharding. Replication allows for data redundancy across servers and increases availability. A replica set consists of primary and secondary nodes that replicate data from the primary. Sharding partitions data across multiple machines (shards) using a shard key and distributes data evenly. Choosing an appropriate shard key is important for uniform distribution of operations across shards.
- MongoDB allows for automatic sharding of data across multiple servers to improve write performance. However, scaling write performance is challenging due to the way B-tree indexes handle random inserts.
- To improve write performance, one can partition data by time or use a hash shard key. However, these have limitations as the data grows large. The best approach is to use a low-cardinality hash prefix combined with a sequential part for the shard key.
- Proper choice of shard key is crucial for scaling MongoDB's write performance as data size increases. Linear scalability is difficult to achieve and alternative databases may be better if extremely high write throughput is required.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Topics covered include:
- Scaling Vertically
- Hardware Considerations
- Index Optimization
- Schema Design
- Sharding
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
This document discusses how to achieve scale with MongoDB. It covers optimization tips like schema design, indexing, and monitoring. Vertical scaling involves upgrading hardware like RAM and SSDs. Horizontal scaling involves adding shards to distribute load. The document also discusses how MongoDB scales for large customers through examples of deployments handling high throughput and large datasets.
Case study of Rujhaan.com (A social news app )Rahul Jain
Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
The document discusses various techniques for optimizing and scaling MongoDB deployments. It covers topics like schema design, indexing, monitoring workload, vertical scaling using resources like RAM and SSDs, and horizontal scaling using sharding. The key recommendations are to optimize the schema and indexes first before scaling, understand the workload, and ensure proper indexing when using sharding for horizontal scaling.
The Fine Art of Schema Design in MongoDB: Dos and Don'tsMatias Cascallares
Schema design in MongoDB can be an art. Different trade offs should be considered when designing how to store your data. In this presentation we are going to cover some common scenarios, recommended practices and don'ts to avoid based on previous experiences
This document discusses building a social analytics tool using MongoDB from a developer's perspective. It covers using MongoDB for its schema-less data and ability to handle fast read-write operations. Key topics include using aggregation queries to gain insights from data by chaining queries together and filtering/manipulating results at each stage. JavaScript capabilities in MongoDB allow applying business logic directly to data. Examples demonstrate removing garbage data and stopwords. Indexes, current progress, and tips/tricks learned around cloning collections and removing vs dropping are also covered, with a demo planned.
The document discusses using MongoDB as a tick store for financial data. It provides an overview of MongoDB and its benefits for handling tick data, including its flexible data model, rich querying capabilities, native aggregation framework, ability to do pre-aggregation for continuous data snapshots, language drivers and Hadoop connector. It also presents a case study of AHL, a quantitative hedge fund, using MongoDB and Python as their market data platform to easily onboard large volumes of financial data in different formats and provide low-latency access for backtesting and research applications.
Learn what you need to consider when moving from the world of relational databases to a NoSQL document store.
Hear from Developer Advocate Glynn Bird as he explains the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
L’architettura di classe enterprise di nuova generazioneMongoDB
The document discusses using MongoDB to build an enterprise data management (EDM) architecture and data lake. It proposes using MongoDB for different stages of an EDM pipeline including storing raw data, transforming data, aggregating data, and analyzing and distributing data to downstream systems. MongoDB is suggested for stages that require secondary indexes, sub-second latency, in-database aggregations, and updating of data. The document also provides examples of using MongoDB for a single customer view and customer profiling and clustering analytics.
Learn MongoDB online at Easylearning.guru. We offer instructor led online training and Life Time LMS (Learning Management System) Access. Join Our Free Live Demo Classes of MongoDB.
The document provides an overview of MongoDB sharding, including:
- Sharding allows horizontal scaling of data by partitioning a database across multiple servers or shards.
- The MongoDB sharding architecture consists of shards to hold data, config servers to store metadata, and mongos processes to route requests.
- Data is partitioned into chunks based on a shard key and chunks can move between shards as the data distribution changes.
This document discusses the MongoDB shell, which is a JavaScript interpreter with built-in support for MongoDB. It can be used for interactive development, testing, administration, and learning MongoDB. The shell allows running queries, saving documents, and has features like command completion, history, and helpers for common tasks. While powerful, it has some limitations like performance and lack of modern JavaScript features. Overall, the document recommends using the shell to better understand and work with MongoDB.
MongoDB training for java software engineersMoshe Kaplan
This document discusses MongoDB and its usage for Java software engineers. It begins with an introduction to MongoDB and discusses how it provides a document-oriented database that scales well for applications. Several examples of companies using MongoDB are also provided, such as Moovit and MediSafe. The document then covers various topics related to using MongoDB such as installation, querying data, data modeling differences from relational databases, migration from SQL to MongoDB, and challenges of MongoDB's schemaless design.
Sharding is a technique for partitioning and distributing data across multiple servers to enable scaling to large data volumes and workloads. It involves defining a shard key to partition data into chunks that are distributed across shards. The document discusses different types of sharding strategies like range, hash, and tag-aware sharding and how they apply to different use cases around scale, geo-distribution, and hardware optimization. It also covers best practices for building a sharded cluster like pre-splitting data, capacity planning, and using tools like MongoDB Management Service for production operations.
This document discusses MongoDB and compares it to traditional databases. It outlines how MongoDB is document-oriented, easy to use, fast, scalable, and has built-in security features. It also describes a collaboration between MongoDB developers 10gen and healthcare experts OSEHRA to build a prototype using MongoDB to store and manage electronic healthcare records.
MongoDB is a document-oriented database that bridges the gap between key-value stores and traditional relational databases. It uses collections of BSON documents (similar to JSON objects) rather than tables. Queries in MongoDB use JSON-like documents rather than SQL. MongoDB also supports features like embedded documents, dynamic schemas, rich queries and indexing to improve performance.
The document discusses tips and tricks for using the MongoDB shell. It describes what the shell is, advantages like debugging queries and administration, and disadvantages like numbers and dates. It provides examples of shell commands and functions for tasks like loading scripts, running commands, profiling, and administration. It also discusses configuration options and ways to improve the shell experience like keyboard shortcuts and the .mongorc.js file.
Training MongoDB - Monitoring and OperabilityNicolas Motte
The document provides an overview of MongoDB, including:
- MongoDB uses a document-based data model and stores data in JSON-like documents using BSON format.
- The main MongoDB processes are mongod, which stores the data, mongos for routing queries, and config servers for metadata.
- MongoDB supports scaling out by adding mongod processes and sharding data.
- The document also discusses Amadeus' architecture using MongoDB, with applications connecting to mongos processes which route to MongoDB clusters.
The document discusses MongoDB's aggregation framework. It provides examples of using aggregation operations like $group, $match, $sort, $limit, $project, and $unwind to summarize and analyze data. Specifically, it shows examples of counting campgrounds by rating, finding the top cities for campgrounds, finding the most appearing comic book characters, and analyzing data on "geeks" in Paris. The document also discusses how indexes can optimize aggregation pipelines and how pipelines can be split between shards for performance.
MongoDB: Advance concepts - Replication and ShardingKnoldus Inc.
The document discusses MongoDB replication and sharding. Replication allows for data redundancy across servers and increases availability. A replica set consists of primary and secondary nodes that replicate data from the primary. Sharding partitions data across multiple machines (shards) using a shard key and distributes data evenly. Choosing an appropriate shard key is important for uniform distribution of operations across shards.
- MongoDB allows for automatic sharding of data across multiple servers to improve write performance. However, scaling write performance is challenging due to the way B-tree indexes handle random inserts.
- To improve write performance, one can partition data by time or use a hash shard key. However, these have limitations as the data grows large. The best approach is to use a low-cardinality hash prefix combined with a sequential part for the shard key.
- Proper choice of shard key is crucial for scaling MongoDB's write performance as data size increases. Linear scalability is difficult to achieve and alternative databases may be better if extremely high write throughput is required.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Topics covered include:
- Scaling Vertically
- Hardware Considerations
- Index Optimization
- Schema Design
- Sharding
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
This document discusses how to achieve scale with MongoDB. It covers optimization tips like schema design, indexing, and monitoring. Vertical scaling involves upgrading hardware like RAM and SSDs. Horizontal scaling involves adding shards to distribute load. The document also discusses how MongoDB scales for large customers through examples of deployments handling high throughput and large datasets.
Case study of Rujhaan.com (A social news app )Rahul Jain
Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
The document discusses various techniques for optimizing and scaling MongoDB deployments. It covers topics like schema design, indexing, monitoring workload, vertical scaling using resources like RAM and SSDs, and horizontal scaling using sharding. The key recommendations are to optimize the schema and indexes first before scaling, understand the workload, and ensure proper indexing when using sharding for horizontal scaling.
The Fine Art of Schema Design in MongoDB: Dos and Don'tsMatias Cascallares
Schema design in MongoDB can be an art. Different trade offs should be considered when designing how to store your data. In this presentation we are going to cover some common scenarios, recommended practices and don'ts to avoid based on previous experiences
This document discusses building a social analytics tool using MongoDB from a developer's perspective. It covers using MongoDB for its schema-less data and ability to handle fast read-write operations. Key topics include using aggregation queries to gain insights from data by chaining queries together and filtering/manipulating results at each stage. JavaScript capabilities in MongoDB allow applying business logic directly to data. Examples demonstrate removing garbage data and stopwords. Indexes, current progress, and tips/tricks learned around cloning collections and removing vs dropping are also covered, with a demo planned.
The document discusses using MongoDB as a tick store for financial data. It provides an overview of MongoDB and its benefits for handling tick data, including its flexible data model, rich querying capabilities, native aggregation framework, ability to do pre-aggregation for continuous data snapshots, language drivers and Hadoop connector. It also presents a case study of AHL, a quantitative hedge fund, using MongoDB and Python as their market data platform to easily onboard large volumes of financial data in different formats and provide low-latency access for backtesting and research applications.
Learn what you need to consider when moving from the world of relational databases to a NoSQL document store.
Hear from Developer Advocate Glynn Bird as he explains the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
L’architettura di classe enterprise di nuova generazioneMongoDB
The document discusses using MongoDB to build an enterprise data management (EDM) architecture and data lake. It proposes using MongoDB for different stages of an EDM pipeline including storing raw data, transforming data, aggregating data, and analyzing and distributing data to downstream systems. MongoDB is suggested for stages that require secondary indexes, sub-second latency, in-database aggregations, and updating of data. The document also provides examples of using MongoDB for a single customer view and customer profiling and clustering analytics.
This document provides an overview of MongoDB for Java developers. It discusses what MongoDB is, how it compares to relational databases, common use cases, data modeling approaches, CRUD operations, indexing, aggregation, replication, sharding, and tools for integrating MongoDB with Java applications. The document contains multiple code examples and concludes with a demonstration of building a sample app with MongoDB.
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
This document discusses using MongoDB as part of an enterprise data management architecture. It begins by describing the rise of data lakes to manage growing and diverse data volumes. Traditional EDWs struggle with this new data variety and volume. The document then provides an overview of MongoDB's features like flexible schemas, secondary indexes, and aggregation capabilities that make it suitable for building different layers of an EDM pipeline for tasks like raw data storage, transformation, analysis, and serving data to downstream systems. Example use cases are presented for building a single customer view and for replacing Oracle with MongoDB.
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB
The document discusses the rise of data lakes and how MongoDB can be used to build modern data management architectures. It provides examples of how companies like a Spanish bank and an insurance leader used MongoDB to create a single customer view across siloed data sources and improve customer experiences. The document also outlines common data processing patterns and how to choose the best data store for different parts of the data pipeline.
This document discusses designing a scalable web architecture for an e-commerce site. It recommends:
1) Using a service-based architecture with microservices for components like the UI, queue, analytics algorithms, and database.
2) Scaling services horizontally using load balancing and auto-scaling.
3) Collecting performance metrics to monitor everything and make data-driven decisions about scaling.
4) Storing data in multiple databases like MySQL, MongoDB, HBase based on their suitability and scaling them independently as services.
MongoDB Days Germany: Data Processing with MongoDBMongoDB
Presented by Marc Schwering, Senior Solutions Architect, MongoDB
Modern architectures are moving away from "one size fits all" solutions. The best tools need to be put to the job and given the large amounts of options today, chances are that you’ll end up using MongoDB for your operational workload, as well as Data Processing Systems like Apache Flink or Spark for your high speed data processing needs. When documents or data structures are modeled, there are some key aspects that need to be attended. This takes into consideration the distribution of data nodes, streaming capabilities, performance, aggregation, and queryability options, and how we can integrate the different data processing software that can benefit from subtle but substantial model changes. This session will cover the way how you enhance your architecture using data processing technologies such as Apache Flink and Spark. It will take the audience through the evolution of an app from simple to complex with its architectural requirements . We´ll look into similarities and differences of available technologies and you will walk away with an understanding how to use MongoDB to fulfill more advanced tasks such as personalization through clustering algorithms.
Glen Smith discusses ways to reduce duplication in Grails user interfaces using Grails resources, Bootstrap, and Less CSS. Resources allow bundling and minimizing JavaScript and CSS, improving performance. Bootstrap provides pre-built HTML and CSS components. Less CSS extends CSS with features like variables, mixins, and nesting to reduce duplication. The talk demonstrates using these techniques and plugins to standardize fonts, layouts, forms, and navigation across a Grails application.
This document discusses data modeling strategies for DocumentDB. It recommends embedding data for one-to-few relationships that infrequently change and won't grow very large, as this provides better read performance. Normalizing data is recommended for one-to-many and many-to-many relationships, and data that frequently changes, as this generally provides better write performance. The document notes there is no single best approach and the optimal model depends on how the data will be written, read, and the specific access patterns. Developers should think about these factors to determine the best modeling strategy.
MongoDB is a document database that provides a more flexible schema than relational databases. It allows embedding related data and easier updates than relational databases with object-relational mapping. MongoDB scales horizontally through sharding and provides high availability through replica sets. It supports different consistency models including eventual and strong consistency through write concerns and read preferences.
MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.
Similar to MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins (20)
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
1. Percona Live 2016
Kimberly Wilkins
Updated Sharding Guidelines in MongoDB 3.x
and Storage Engine Considerations
Principal Engineer - Databases
Rackspace/ObjectRocket
www.linkedin.com/in/wilkinskimberly, @dba_denizen,
kimberly.wilkins@rackspace.com
2. My Background
• 18+ Years working on various database platforms
• Mainly Oracle (dab’s, RAC, Enterprise Manager, GoldenGate Replication,
DataGuard, Database Vault, Exadata)
• MongoDB NoSQL and Big Data Infrastructure and techs at OR
• Industries –early online auto auctions, gaming, social media
• Specialties –re-architect enterprise db environments, infrastructure,
implementations, RAC, replication, system kernels, database storage
• Re-engineered the database infrastructure for SWTOR –Star Wars The
Old Republic MMO Game
3. Overview - Sharding
• What is Sharding?
• Why Shard?
• When to Shard? When not to Shard?
• Sharding Process
• Selecting <Good> Shard Keys
• Specific Tips and Examples, Managing Shards and Scaling
• Radical Ideas (?) and Storage Engine Considerations
4. Sharding – What is it?
• Sharding = Horizontal Scaling, Partitioning
• Scale Out – add physical or virtual hosts
• Add supporting network and app layers
Redundancy Flexible, Scalable
Architectures
Add Resources on the fly
HA DR
Fault Tolerant
Clusters
Many Different
Sources, Types
5. Commodity Hardware vs. Big Iron
• Multiple smaller hosts or Virtuals/Containerization
• Larger Single Servers with Massive CPU, RAM, SAN’s
6.
7. Out (Horizontal) vs Up (Vertical)
• Multiple smaller hosts or Virtuals/Containerization
• Larger Single Servers with Massive CPU, RAM, SAN’s
• Out NOT Up, more smaller not fewer BIGGER
8. Why Would You Need or Want to Shard?
• Scalability, Performance, High Availability, Redundancy
9. High Availability Matters; Redundancy Matters.
Without a way to post, creep and
comment on things within the
world's most popular social
network, many turned to Twitter
with their updates. PANIC!!
“#facebookdown day 1, minute 3: we still have
electricity, poppa has been hoarding antibiotics
and microbrews. We're gonna ride this out.”
There are people on the streets hurling printed-
out pics of their kids at strangers bellowing
“Like them. Like them.” #facebookdown
13. Why to Shard? … Why NoSQL?
• Faster, more flexible development – 24%
• Lower software, hardware, and deployment $$ - 21%
• Performance - faster writes, faster reads
• Developers – “schemaless”, cool toys
• ^^ dev’s than ^ dba’s
• Variety of NoSQL Technologies
14. RDBMS NoSQL
records documents
tables collections, buckets, tables
rows fields
set data types flexible data types
rigid schemas, structured data Unstructured & structured data
primary keys document or objectId’s
normalized de-normalized
referential integrity duplicated data is OK
joins index intersections, partials
15. When To Shard?
• Need Better Performance
• Need Additional Write Scopes
• App development today => Think ahead, expect growth
• EARLY – Shard BEFORE You Run Out of Resources
• Have Different Use cases
• Best Tool for the Job - aka Polyglot Persistence
18. How to Shard?
• Architectural Overview
• General Process and Steps for Sharding
• Shard Key Selection
• Details, Examples, and Tips
• Managing Shards and Replication
• Radical ideas and Storage Engine Consideration
21. • Good key, good performance. Bad key, bad performance.
• NO PERFECT Shard Key –trade-offs – users/social apps
• Shard Key - in all docs – immutable **
• Shard Key -used in queries, know your query patterns
• Easily divisible – for balanced chunks, increase cardinality
• Consider Compound Keys to better limit return set
• Shard early, shard often – impactful so don’t wait
Shard Key Considerations
22. • What Does your App do? How does it work?
• More read heavy? More write heavy? Balanced 50/50?
• 1 activity more important than others - ex. we write a lot but
we make our $ by people querying
• Expected growth patterns - per week? per month? per year?
• Busy times of day? week? month? year?
• Bulk Loads/Deletes? ever? when?
• Current pain or performance problem areas?
ASK <Additional> QUESTIONS!!!
23. • Profiler to ALL - REPRESENTATIVE time period
• Type of queries, # per namespace
• Patterns and predicate via aggregs
• Check for nulls – NO nulls allowed shard keys
• Consider Compound Keys - limit return set
• Check Cardinality – on secondaries – less hurtful!!
• NO PERFECT Shard Key –trade-offs – users/social apps
How to Shard – General Steps
24. How to Shard – Specific Tasks
• Perform Profiling and Query Pattern Analysis
• Select the BEST option for the Shard Key
• Create the Required Shard Key Index
• Disable the Balancer
• Enable Sharding at The DB level
• Shard the Collection / Add Shards
• Re-enable the Balancer
25. Sample Shard Key Evaluation Queries/Aggs
• **Run Queries and Aggregations against unused Secondaries**
SECONDARY> db.events.new.aggregate([{$project:{”BoxId":1}},{ $group: { _id: "$BoxId"} },{ $group: { _id:
1, count: { $sum: 1 } } }],{allowDiskUse:true})
{ "_id" : 1, "count" : 3303464 ** Note good cardinality here**
• SECONDARY> db.events.new.aggregate([{$project:{”BoxId":1}},{$group: { _id:"$BoxId",number : {$sum:
1}}},{$sort:{number:-1}},{$limit:20}],{allowDiskUse:true})
• { "_id" : "pnx-xxxxxxxx.003", "number" : 46889 }
• { "_id" : "jhx-xxxxxxxx.002, "number" : 23644 }
• { "_id" : "3tq9-xxxxxxxx.001", "number" : 17769 }
-Look for NULLS and for DISTINCT
-Look at sample values of documents and fields near FRONT and BACK of the collection
26. Another Shard Key Aggregation example
• Run aggregation query to find the most common reference id (rid) values
and sort to give you the top 5.
• Run against the PROFILE collection and can run on SECONDARIES of busy
systems to prevent impact to your application that works via primaries.
SECONDARY> db.items.aggregate([{$project:{rid:1}},
{$group:{_id:"$rid",count:{$sum:1}}},{$sort:
{count:-1}},{$limit:5}])
28. Actual Sample Sharding Commands
• Shard Key Selection Analysis and Considerations
• Create required index :
use users; db.users.ensureIndex( {“_id” : “hashed”},{background:true} );
• Enable sharding at the db level :
use admin; db.runCommand( {enablesharding: “users”} );
• Shard the collection
db.adminCommand( { shardCollection :“users.users”,key : {“_id”:”hashed”} } );
29. Pre-Sharding for Very Active, Larger Collections
• Connect to a‘non-real’ mongo shell
• Use javascript to create javascript for desired goals
• Start a screen or tmux and name it
• Connect via MongoS to your real desired instance as admin db
• Use the generated scripts/commands to enable sharding at
the db level then create the collections with desired #of pre-
allocated initial chunks
32. Confirm pre-created chunks and Balance
mongos> sh.status()
{ "_id" : ”hits-2016-15", "partitioned" : true, "primary" : ”shardkw1" }
hits-2016-15.hits-2016-15
shard key: { "_id" : "hashed" }
chunks:
shardkw1 1030
shardkw2 638 <<removed 2 lines >> bit see still growing there with natural splits>>
{ "_id" : ”hits-2016-16", "partitioned" : true, "primary" : "shardkw1" }
hits-2016-16.hits-2016-16 <<Just checking that correct weeks were created>>
shard key: { "_id" : "hashed" }
chunks:
shardkw1 500
shardkw2 500
shardkw3 500
shardkw4 500
too many chunks to print, use verbose if you want to force print
33. • 1st case - Large # of of Small sized Shards
• MANY Smaller shards as they need additional write scopes
• 2nd case - Medium # of Medium sized Shards
• Larger but still need write scopes but without users spread so far across all of the
shards when reading
• 3rd case - Smaller # of larger sized shards
• Need additional resources for higher number of connections, higher number of queries
• IN ALL 3 Cases – they are sharded on write friendly "_id" : "hashed”
3 Very Different Use Cases for Sharding
34. BY - Large # Small Shards DR – Medium # Medium Shards BS – Small # Large Shards
mobile analytics and marketing app
Shard Key - "_id" : "hashed"
social media app holding connective user data
Shard Key - "_id" : "hashed”
Mobile game marketing and monetization
customer
Shard Key - "_id" : "hashed"
256 million smaller user docs of ~2143 bytes
Smaller user updates and campaigns
~82 million bigger user docs of ~26036 bytes ~~10 billion smaller device docs of ~ 252 bytes
Lots and lots of devices - mobile phones
45 shards @ 20G Plan size 22 shards @ 100G Plan size 7 shards @ 500G Plan Size
100 – 160 Queries per Second 100 – 125 Queries per Second 400 – 2000 Queries per Second
*have seen up to 300,000 QPS
20 – 40 Updates per Second 85 – 110 Updates per Second 20 – 40 Updates per Second
10 – 20 Inserts per Second
~1200 connections per shard * 45 shards
so ~54,000 connections
~4000 connections per shard * 22 shards
so ~88,000 connections
~5700 connections per shard *7 shards
so ~40,000 connections
Need more smaller shards for the lot more write
scopes
Need more write scopes but not the associated
spread out scatter gathers so not as many shards
Need additional resources of larger shards due to
higher number of queries, connections, and
smaller size of objects
Well balanced chunks and disks Well balanced chunks and disks AFTER initially
taking a bit to get balanced
Not balanced naturally – must manipulate via
numInitialCHhnks at new db and sharded
collection creation point
36. Bad Shard Keys…. Bad Performance
• Hot Spotting for Writes
• Hot Spotting for Reads
• Disk Imbalance
• Jumbo Chunks
• Slow Queries
• Slow Performance
• Slow Apps
• Angry Customers
37. Bad Shard Key… What to Do?
Fix It !!!
- dump & restore
- drains
38. Bad Shard Key… Fixing
• Dump and Restore
– Dump collection; drop collection; recreate collection
– Re-shard collection, restore collection
• Drain Shard
– Estimate moveChunk time
db.getSiblingDB("config").changelog.find({"what" : "moveChunk.commit"},{time:1,_id:
0}).sort({time:-1})
– Run js script to generate moveChunk commands
– Stop Balancer -Run moveChunk script
– -Run removeShard command twice – Restart Balancer
• ;
40. Larger Replica Set vs. Sharding
Replica Set Sharding
Want simplification Expertise for Sharding
Lots of reads – don’t want
scatter gathers
Lots of writes/updates – want to
go directly to exact shards
Lots of data, lower activity Lots of data, lots of activity
Need More ‘normal’ resources
– just disks, just memory, etc.
Need more of all resources –
disks, RAM, CPU, write scopes
Application Knowledge Application Knowledge
Religious War - Do Not Engage Religious War - Do Not Engage
42. WiredTiger vs. MMAPv1 –Generalizations ONLY
WiredTiger MMAPv1
Freq writes, inserts, appends Still better for heavy read loads
Compression; defragmentation No compression, fragmentation
Intent level locking (document) Collection level locking
Mass bulk loads, small docs V Updates in place, esp, that grow
Complete write and replace Updates existing, grow and move
Cache Eviction settings and
issues, Cache settings, threads
Will use all memory allocated –
memory mapped files