Find out which is faster, SQL or NoSQL, for traditional reporting tasks. Discover how you can optimise MongoDB aggregation pipelines and how to push complex computation down to the database.
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. I will share more common mistakes observed and some tips and tricks to avoiding them.
This document discusses tuning MongoDB performance. It covers tuning queries using the database profiler and explain commands to analyze slow queries. It also covers tuning system configurations like Linux settings, disk I/O, and memory to optimize MongoDB performance. Topics include setting ulimits, IO scheduler, filesystem options, and more. References to MongoDB and Linux tuning documentation are also provided.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
Slidedeck presented at http://devternity.com/ around MongoDB internals. We review the usage patterns of MongoDB, the different storage engines and persistency models as well has the definition of documents and general data structures.
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
Inside MongoDB: the Internals of an Open-Source DatabaseMike Dirolf
The document discusses MongoDB, including how it stores and indexes data, handles queries and replication, and supports sharding and geospatial indexing. Key points covered include how MongoDB stores data in BSON format across data files that grow in size, uses memory-mapped files for data access, supports indexing with B-trees, and replicates operations through an oplog.
MongoDB World 2019: Tips and Tricks++ for Querying and Indexing MongoDBMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a senior member of the support team I will share more common mistakes observed and some tips and tricks to avoiding them.
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. I will share more common mistakes observed and some tips and tricks to avoiding them.
This document discusses tuning MongoDB performance. It covers tuning queries using the database profiler and explain commands to analyze slow queries. It also covers tuning system configurations like Linux settings, disk I/O, and memory to optimize MongoDB performance. Topics include setting ulimits, IO scheduler, filesystem options, and more. References to MongoDB and Linux tuning documentation are also provided.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
Slidedeck presented at http://devternity.com/ around MongoDB internals. We review the usage patterns of MongoDB, the different storage engines and persistency models as well has the definition of documents and general data structures.
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
Inside MongoDB: the Internals of an Open-Source DatabaseMike Dirolf
The document discusses MongoDB, including how it stores and indexes data, handles queries and replication, and supports sharding and geospatial indexing. Key points covered include how MongoDB stores data in BSON format across data files that grow in size, uses memory-mapped files for data access, supports indexing with B-trees, and replicates operations through an oplog.
MongoDB World 2019: Tips and Tricks++ for Querying and Indexing MongoDBMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a senior member of the support team I will share more common mistakes observed and some tips and tricks to avoiding them.
Presented by Tom Schreiber, Senior Consulting Engineer, MongoDB
MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application? In this talk we’ll cover how indexing works, the various indexing options, and cover use cases where each might be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale. We'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
Robert Haas
Why does my query need a plan? Sequential scan vs. index scan. Join strategies. Join reordering. Joins you can't reorder. Join removal. Aggregates and DISTINCT. Using EXPLAIN. Row count and cost estimation. Things the query planner doesn't understand. Other ways the planner can fail. Parameters you can tune. Things that are nearly always slow. Redesigning your schema. Upcoming features and future work.
Indexing in MongoDB works similarly to indexing in relational databases. An index is a data structure that can make certain queries more efficient by maintaining a sorted order of documents. Indexes are created using the ensureIndex() method and take up additional space and slow down writes. The explain() method is used to determine whether a query is using an index.
The document provides an overview of Hive architecture and workflow. It discusses how Hive converts HiveQL queries to MapReduce jobs through its compiler. The compiler includes components like the parser, semantic analyzer, logical and physical plan generators, and logical and physical optimizers. It analyzes sample HiveQL queries and shows the transformations done at each compiler stage to generate logical and physical execution plans consisting of operators and tasks.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB
“Why is MongoDB so slow?” you may ask yourself on occasion. You’ve created indexes, you’ve learned how to use the aggregation pipeline. What the heck? Could it be your queries? This talk will outline what tools are at your disposal (both in MongoDB Atlas and in MongoDB server) to identify inefficient queries.
This presentation contains differences between Elasticsearch and relational Databases. Along with that it also has some Glossary Of Elasticsearch and its basic operation.
This presentation will demonstrate how you can use the aggregation pipeline with MongoDB similar to how you would use GROUP BY in SQL and the new stage operators coming 3.4. MongoDB’s Aggregation Framework has many operators that give you the ability to get more value out of your data, discover usage patterns within your data, or use the Aggregation Framework to power your application. Considerations regarding version, indexing, operators, and saving the output will be reviewed.
PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application?
Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.
[pgday.Seoul 2022] PostgreSQL with Google CloudPgDay.Seoul
Google Cloud offers several fully managed database services for PostgreSQL workloads, including Cloud SQL and AlloyDB.
Cloud SQL provides a fully managed relational database service for PostgreSQL, MySQL, and SQL Server. It offers 99.999% availability, unlimited scaling, and automatic failure recovery.
AlloyDB is a new database engine compatible with PostgreSQL that provides up to 4x faster transactions and 100x faster analytics queries than standard PostgreSQL. It features independent scaling of storage and computing resources.
Google Cloud aims to be the best home for PostgreSQL workloads by providing compatibility with open source PostgreSQL and enterprise-grade features, performance, reliability, and support across its database services.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
MongoDB sharded cluster. How to design your topology ?Mydbops
This slides was presented at Mydbops Database Meetup 4 on Aug-03 2019 by Vinodh Krishnaswamy ( Percona ). This talk focuses on when to go for sharing topology in MongoDB and their benefits and impact.
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB
The document discusses different data modeling approaches for structuring data in MongoDB, including embedding data versus referencing data in collections. It provides examples of modeling one-to-one, one-to-many, and many-to-many relationships between entities using embedding and referencing. The document recommends different approaches depending on the use case and prioritizes flexibility, performance, and optimal data representation.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
The document provides an introduction to the ELK stack, which is a collection of three open source products: Elasticsearch, Logstash, and Kibana. It describes each component, including that Elasticsearch is a search and analytics engine, Logstash is used to collect, parse, and store logs, and Kibana is used to visualize data with charts and graphs. It also provides examples of how each component works together in processing and analyzing log data.
MongoDB World 2019: RDBMS Versus MongoDB Aggregation PerformanceMongoDB
Join me as we compare the performance of MySQL and MongoDB aggregating and analyzing data against a large, real-world data set. From this talk, you will learn when MongoDB is faster than MySQL, why that's the case, and that doctors appear to do some very questionable things.
Deploying any software can be a challenge if you don't understand how resources are used or how to plan for the capacity of your systems. Whether you need to deploy or grow a single MongoDB instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment.
This webinar will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs for new and growing deployments. The goal of this webinar will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.
Presented by Tom Schreiber, Senior Consulting Engineer, MongoDB
MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application? In this talk we’ll cover how indexing works, the various indexing options, and cover use cases where each might be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale. We'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
Robert Haas
Why does my query need a plan? Sequential scan vs. index scan. Join strategies. Join reordering. Joins you can't reorder. Join removal. Aggregates and DISTINCT. Using EXPLAIN. Row count and cost estimation. Things the query planner doesn't understand. Other ways the planner can fail. Parameters you can tune. Things that are nearly always slow. Redesigning your schema. Upcoming features and future work.
Indexing in MongoDB works similarly to indexing in relational databases. An index is a data structure that can make certain queries more efficient by maintaining a sorted order of documents. Indexes are created using the ensureIndex() method and take up additional space and slow down writes. The explain() method is used to determine whether a query is using an index.
The document provides an overview of Hive architecture and workflow. It discusses how Hive converts HiveQL queries to MapReduce jobs through its compiler. The compiler includes components like the parser, semantic analyzer, logical and physical plan generators, and logical and physical optimizers. It analyzes sample HiveQL queries and shows the transformations done at each compiler stage to generate logical and physical execution plans consisting of operators and tasks.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB
“Why is MongoDB so slow?” you may ask yourself on occasion. You’ve created indexes, you’ve learned how to use the aggregation pipeline. What the heck? Could it be your queries? This talk will outline what tools are at your disposal (both in MongoDB Atlas and in MongoDB server) to identify inefficient queries.
This presentation contains differences between Elasticsearch and relational Databases. Along with that it also has some Glossary Of Elasticsearch and its basic operation.
This presentation will demonstrate how you can use the aggregation pipeline with MongoDB similar to how you would use GROUP BY in SQL and the new stage operators coming 3.4. MongoDB’s Aggregation Framework has many operators that give you the ability to get more value out of your data, discover usage patterns within your data, or use the Aggregation Framework to power your application. Considerations regarding version, indexing, operators, and saving the output will be reviewed.
PostgreSQL comes built-in with a variety of indexes, some of which are further extensible to build powerful new indexing schemes. But what are all these index types? What are some of the special features of these indexes? What are the size & performance tradeoffs? How do I know which ones are appropriate for my application?
Fortunately, this talk aims to answer all of these questions as we explore the whole family of PostgreSQL indexes: B-tree, expression, GiST (of all flavors), GIN and how they are used in theory and practice.
[pgday.Seoul 2022] PostgreSQL with Google CloudPgDay.Seoul
Google Cloud offers several fully managed database services for PostgreSQL workloads, including Cloud SQL and AlloyDB.
Cloud SQL provides a fully managed relational database service for PostgreSQL, MySQL, and SQL Server. It offers 99.999% availability, unlimited scaling, and automatic failure recovery.
AlloyDB is a new database engine compatible with PostgreSQL that provides up to 4x faster transactions and 100x faster analytics queries than standard PostgreSQL. It features independent scaling of storage and computing resources.
Google Cloud aims to be the best home for PostgreSQL workloads by providing compatibility with open source PostgreSQL and enterprise-grade features, performance, reliability, and support across its database services.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
MongoDB sharded cluster. How to design your topology ?Mydbops
This slides was presented at Mydbops Database Meetup 4 on Aug-03 2019 by Vinodh Krishnaswamy ( Percona ). This talk focuses on when to go for sharing topology in MongoDB and their benefits and impact.
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB
The document discusses different data modeling approaches for structuring data in MongoDB, including embedding data versus referencing data in collections. It provides examples of modeling one-to-one, one-to-many, and many-to-many relationships between entities using embedding and referencing. The document recommends different approaches depending on the use case and prioritizes flexibility, performance, and optimal data representation.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
The document provides an introduction to the ELK stack, which is a collection of three open source products: Elasticsearch, Logstash, and Kibana. It describes each component, including that Elasticsearch is a search and analytics engine, Logstash is used to collect, parse, and store logs, and Kibana is used to visualize data with charts and graphs. It also provides examples of how each component works together in processing and analyzing log data.
MongoDB World 2019: RDBMS Versus MongoDB Aggregation PerformanceMongoDB
Join me as we compare the performance of MySQL and MongoDB aggregating and analyzing data against a large, real-world data set. From this talk, you will learn when MongoDB is faster than MySQL, why that's the case, and that doctors appear to do some very questionable things.
Deploying any software can be a challenge if you don't understand how resources are used or how to plan for the capacity of your systems. Whether you need to deploy or grow a single MongoDB instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment.
This webinar will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs for new and growing deployments. The goal of this webinar will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
This document discusses choosing the right time series database for IoT data. It compares InfluxDB to SQL Server and other databases.
Some key points made:
- InfluxDB outperforms SQL Server for writes by 40x and queries by 59x for time series data due to its optimized design.
- InfluxDB uses 19x-26x less disk storage than SQL Server for the same data.
- InfluxDB also outperforms MongoDB, Elasticsearch, OpenTSDB, and Cassandra for time series workloads.
- Azure Stream Insights is a managed service but has limited capabilities and can be pricey for high volumes of data.
- InfluxDB is open source, has no dependencies, and
Speaker: Akira Kurogane, Senior Technical Services Engineer, MongoDB
Level: 300 (Advanced)
Track: Performance
One week your active dataset consumes 90% of available RAM. The next week it's 110%. Is that a 10% or 99% performance degradation? Let's discover what it looks like when different hardware capacity limitations are hit. For example, memory vs. disk bottlenecks, the rare CPU bottleneck and network bottlenecks, seeing what happens when you drop a crucial index during peak load, or what happens when you run multiple WiredTiger nodes on the same server without limiting their cache size.
What You Will Learn:
- Performance analysis
- Post-mortem log analysis
- Capacity planning
The Care + Feeding of a Mongodb ClusterChris Henry
This document summarizes best practices for scaling MongoDB deployments. It discusses Behance's use of MongoDB for their activity feed, including moving from 40 nodes with 250M documents on ext3 to 60 nodes with 400M documents on ext4. It covers topics like sharding, replica sets, indexing, maintenance, and hardware considerations for large MongoDB clusters.
This presentation was given at the LDS Tech SORT Conference 2011 in Salt Lake City. The slides are quite comprehensive covering many topics on MongoDB. Rather than a traditional presentation, this was presented as more of a Q & A session. Topics covered include. Introduction to MongoDB, Use Cases, Schema design, High availability (replication) and Horizontal Scaling (sharding).
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
This document discusses in-memory caching in HDFS to improve query latency. The implementation caches important datasets in the DataNode memory and allows clients to directly access cached blocks via zero-copy reads without checksum verification. Evaluation shows the zero-copy reads approach provides significant performance gains over short-circuit and TCP reads for both microbenchmarks and Impala queries, with speedups of up to 7x when the working set fits in memory. MapReduce jobs see more modest gains as they are often not I/O bound.
The document discusses sizing a MongoDB cluster for a large coffee chain called PlanetDollar. It describes collecting mobile app performance data, including 2 years of historical event data with 3000-5000 events per second. The key steps to size the MongoDB cluster are: 1) calculate the collection and index sizes based on the amount of data, 2) estimate the working set size based on frequently accessed data, 3) use a simplified model to estimate IOPS requirements and adjust based on factors like working sets, and 4) calculate the number of shards needed based on storage, memory and IOPS requirements.
Jay Runkel presented a methodology for sizing MongoDB clusters to meet the requirements of an application. The key steps are: 1) Analyze data size and index size, 2) Estimate the working set based on frequently accessed data, 3) Use a simplified model to estimate IOPS and adjust for real-world factors, 4) Calculate the number of shards needed based on storage, memory and IOPS requirements. He demonstrated this process for an application that collects mobile events, requiring a cluster that can store over 200 billion documents with 50,000 IOPS.
MongoDB is a document-oriented NoSQL database that uses flexible schemas and provides high performance, high availability, and easy scalability. It uses either MMAP or WiredTiger storage engines and supports features like sharding, aggregation pipelines, geospatial indexing, and GridFS for large files. While MongoDB has better performance than Cassandra or Couchbase according to benchmarks, it has limitations such as a single-threaded aggregation and lack of joins across collections.
This document discusses how to achieve scale with MongoDB. It covers optimization tips like schema design, indexing, and monitoring. Vertical scaling involves upgrading hardware like RAM and SSDs. Horizontal scaling involves adding shards to distribute load. The document also discusses how MongoDB scales for large customers through examples of deployments handling high throughput and large datasets.
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
This document provides guidance on sizing MongoDB deployments on AWS for optimal performance. It discusses key considerations for capacity planning like testing workloads, measuring performance, and adjusting over time. Different AWS services like compute-optimized instances and storage options like EBS are reviewed. Best practices for WiredTiger like sizing cache, effects of compression and encryption, and monitoring tools are covered. The document emphasizes starting simply and scaling based on business needs and workload profiling.
This document provides an introduction and agenda for a presentation on MongoDB 2.4 and Spring Data. The presentation will include a quick introduction to NoSQL and MongoDB, an overview of Spring Data's MongoDB support including configuration, templates, repositories and queries, and details on metadata mapping, aggregation functions, GridFS file storage and indexes in MongoDB.
Scaling with sync_replication using Galera and EC2Marco Tusa
Challenging architecture design, and proof of concept on a real case of study using Syncrhomous solution.
Customer asks me to investigate and design MySQL architecture to support his application serving shops around the globe.
Scale out and scale in base to sales seasons.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
MongoDB stores data in files on disk that are broken into variable-sized extents containing documents. These extents, as well as separate index structures, are memory mapped by the operating system for efficient read/write. A write-ahead journal is used to provide durability and prevent data corruption after crashes by logging operations before writing to the data files. The journal increases write performance by 5-30% but can be optimized using a separate drive. Data fragmentation over time can be addressed using the compact command or adjusting the schema.
These are the slides I presented at the Nosql Night in Boston on Nov 4, 2014. The slides were adapted from a presentation given by Steve Francia in 2011. Original slide deck can be found here:
http://spf13.com/presentation/mongodb-sort-conference-2011
MongoDb is a document oriented database and very flexible one as it gives horizontal scalability.
In this presentation basic study about mongodb with installation steps and basic commands are described.
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
This presentation discusses migrating data from other data stores to MongoDB Atlas. It begins by explaining why MongoDB and Atlas are good choices for data management. Several preparation steps are covered, including sizing the target Atlas cluster, increasing the source oplog, and testing connectivity. Live migration, mongomirror, and dump/restore options are presented for migrating between replicasets or sharded clusters. Post-migration steps like monitoring and backups are also discussed. Finally, migrating from other data stores like AWS DocumentDB, Azure CosmosDB, DynamoDB, and relational databases are briefly covered.
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB
MongoDB Kubernetes operator and MongoDB Open Service Broker are ready for production operations. Learn about how MongoDB can be used with the most popular container orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications. A demo will show you how easy it is to enable MongoDB clusters as an External Service using the Open Service Broker API for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB
Are you new to schema design for MongoDB, or are you looking for a more complete or agile process than what you are following currently? In this talk, we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
Humana, like many companies, is tackling the challenge of creating real-time insights from data that is diverse and rapidly changing. This is our journey of how we used MongoDB to combined traditional batch approaches with streaming technologies to provide continues alerting capabilities from real-time data streams.
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
Time series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time series data can enable organizations to better detect and respond to events ahead of their competitors or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe.
This talk covers:
Common components of an IoT solution
The challenges involved with managing time-series data in IoT applications
Different schema designs, and how these affect memory and disk utilization – two critical factors in application performance.
How to query, analyze and present IoT time-series data using MongoDB Compass and MongoDB Charts
At the end of the session, you will have a better understanding of key best practices in managing IoT time-series data with MongoDB.
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
Our clients have unique use cases and data patterns that mandate the choice of a particular strategy. To implement these strategies, it is mandatory that we unlearn a lot of relational concepts while designing and rapidly developing efficient applications on NoSQL. In this session, we will talk about some of our client use cases, the strategies we have adopted, and the features of MongoDB that assisted in implementing these strategies.
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
Encryption is not a new concept to MongoDB. Encryption may occur in-transit (with TLS) and at-rest (with the encrypted storage engine). But MongoDB 4.2 introduces support for Client Side Encryption, ensuring the most sensitive data is encrypted before ever leaving the client application. Even full access to your MongoDB servers is not enough to decrypt this data. And better yet, Client Side Encryption can be enabled at the "flick of a switch".
This session covers using Client Side Encryption in your applications. This includes the necessary setup, how to encrypt data without sacrificing queryability, and what trade-offs to expect.
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
MongoDB Kubernetes operator is ready for prime-time. Learn about how MongoDB can be used with most popular orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications.
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
When you need to model data, is your first instinct to start breaking it down into rows and columns? Mine used to be too. When you want to develop apps in a modern, agile way, NoSQL databases can be the best option. Come to this talk to learn how to take advantage of all that NoSQL databases have to offer and discover the benefits of changing your mindset from the legacy, tabular way of modeling data. We’ll compare and contrast the terms and concepts in SQL databases and MongoDB, explain the benefits of using MongoDB compared to SQL databases, and walk through data modeling basics so you feel confident as you begin using MongoDB.
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
The document discusses guidelines for ordering fields in compound indexes to optimize query performance. It recommends the E-S-R approach: placing equality fields first, followed by sort fields, and range fields last. This allows indexes to leverage equality matches, provide non-blocking sorts, and minimize scanning. Examples show how indexes ordered by these guidelines can support queries more efficiently by narrowing the search bounds.
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
The document describes a methodology for data modeling with MongoDB. It begins by recognizing the differences between document and tabular databases, then outlines a three step methodology: 1) describe the workload by listing queries, 2) identify and model relationships between entities, and 3) apply relevant patterns when modeling for MongoDB. The document uses examples around modeling a coffee shop franchise to illustrate modeling approaches and techniques.
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented. In addition, we'll discuss future plans and opportunities and offer ample Q&A time with the engineers on the project.
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
Virtual assistants are becoming the new norm when it comes to daily life, with Amazon’s Alexa being the leader in the space. As a developer, not only do you need to make web and mobile compliant applications, but you need to be able to support virtual assistants like Alexa. However, the process isn’t quite the same between the platforms.
How do you handle requests? Where do you store your data and work with it to create meaningful responses with little delay? How much of your code needs to change between platforms?
In this session we’ll see how to design and develop applications known as Skills for Amazon Alexa powered devices using the Go programming language and MongoDB.
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
aux Core Data, appréciée par des centaines de milliers de développeurs. Apprenez ce qui rend Realm spécial et comment il peut être utilisé pour créer de meilleures applications plus rapidement.
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
Il n’a jamais été aussi facile de commander en ligne et de se faire livrer en moins de 48h très souvent gratuitement. Cette simplicité d’usage cache un marché complexe de plus de 8000 milliards de $.
La data est bien connu du monde de la Supply Chain (itinéraires, informations sur les marchandises, douanes,…), mais la valeur de ces données opérationnelles reste peu exploitée. En alliant expertise métier et Data Science, Upply redéfinit les fondamentaux de la Supply Chain en proposant à chacun des acteurs de surmonter la volatilité et l’inefficacité du marché.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid
IBM watsonx Code Assistant for Z, our latest Generative AI-assisted mainframe application modernization solution. Mainframe (IBM Z) application modernization is a topic that every mainframe client is addressing to various degrees today, driven largely from digital transformation. With generative AI comes the opportunity to reimagine the mainframe application modernization experience. Infusing generative AI will enable speed and trust, help de-risk, and lower total costs associated with heavy-lifting application modernization initiatives. This document provides an overview of the IBM watsonx Code Assistant for Z which uses the power of generative AI to make it easier for developers to selectively modernize COBOL business services while maintaining mainframe qualities of service.
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...The Third Creative Media
"Navigating Invideo: A Comprehensive Guide" is an essential resource for anyone looking to master Invideo, an AI-powered video creation tool. This guide provides step-by-step instructions, helpful tips, and comparisons with other AI video creators. Whether you're a beginner or an experienced video editor, you'll find valuable insights to enhance your video projects and bring your creative ideas to life.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Drona Infotech is a premier mobile app development company in Noida, providing cutting-edge solutions for businesses.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Project Management: The Role of Project Dashboards.pdfKarya Keeper
Project management is a crucial aspect of any organization, ensuring that projects are completed efficiently and effectively. One of the key tools used in project management is the project dashboard, which provides a comprehensive view of project progress and performance. In this article, we will explore the role of project dashboards in project management, highlighting their key features and benefits.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Paul Brebner
Closing talk for the Performance Engineering track at Community Over Code EU (Bratislava, Slovakia, June 5 2024) https://eu.communityovercode.org/sessions/2024/why-apache-kafka-clusters-are-like-galaxies-and-other-cosmic-kafka-quandaries-explored/ Instaclustr (now part of NetApp) manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs. vertical scalability, and predicting Kafka performance using metrics, modelling and regression techniques. These insights are relevant to Kafka developers and operators.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
2. The Aggregation Framework
• What is it?
• When should I use it?
• What can it do and not do?
• When should I use it instead of an RDBMS
3. What is the Aggregation Framework
• It’s a data transformation pipeline.
• It ultimately a Turing complete functional language.
• It’s SELECT AS JOIN GROUP BY HAVING.
• It’s a fun challenge to use.
4. What can it do? and not do?
• It can read and examine documents and apply logic to them and
create new ones.
• Technically – it can do almost anything.
• Mine Bitcoins.
• Learn (in the AI sense).
• Emulate / Transpile SQL statements.
• Generate graphics.
• Run simulations.
• It can’t currently edit existing data in place.
5. When should I definitely use it.
• When the data’s in MongoDB and you don’t want to copy it.
• When you want to report on live data.
• When your application operations require more than find()
6. When should I use it versus my RDBMS?
• That’s a very good question.
8. Let us take a scenario
• You have a set of data
• You want to Report on it and Analyze it
• This data isn’t live – so we don’t need to worry about that.
• There may be a lot of it.
10. Data Details
• Every medical practice in England
• 10+ years available month by month
• Quantity and cost of each item prescribed and number of scripts.
• >100 million+ rows a year
13. The Hardware
• Centos 7 – on Amazon EC2
• 32GB RAM
• 4 CPU Cores
• Databases on 2000 IOPS 400GB Disk
• Temp files on 1200 IOPS 400GB Disk
14. In the Blue corner
• MySQL Version 8.0
• Out of the box defaults
• Cache (innodb_buffer_pool) set to 80% of RAM
• 3 Tables (13GB)
• Indexing as required
15. In the Green Corner
• MongoDB 4.0.3
• Cache set to default (50% - 1GB)
• 1 Collection
• 15 GB of BSON
• 5GB on disk due to Snappy.
17. How much did the UK Spend in 2017?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
20. The other Result
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
37 Seconds 54 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
21. Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
• Depth Matters
• More flexibility
1 Bob 3.5 18-7-1972 NULL
2 Sally 8.9 15-3-1984 “Magic”
22. Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
_id:int 1 name: str(3) “bob” size:double 3.5 when: date 18-7-1972
23. Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
• Hierarchy Matters
• Much more flexibility
_id:int 1 name: str(3) bob sizes: array(256) [
double 3.5.
double 10.
double 1.2,
double 99]
when: date 18-7-1972
24. What about an Index?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
21 Seconds (vs. 37) 54 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
Apply Covering
Index Here
25. But can’t MongoDB index cover too?
• Yes, but not when it’s a Multikey (array) index
• We only index unique values index only once
• So the index cannot recreate the array
26. Can we fix that?
• What if we flatten the data?
• Lots of redundancy
• Collection is now 200% larger
• Normalisation?
db.prescriptions.aggregate([
{$unwind:”$prescriptions”},
{$project:{_id:0}},
{$out: “tabular”}
])
db.tabular.createIndex({‘prescriptions.cost’:1})
27. Flat, wide data.
• 110 M Rows
• 51 GB as BSON
• 15 GB Compressed
• Not really tabular.
• 860 MB Index
• Prefix Compression
• Super space efficient
28. Query Performance when flattened.
MongoDB No covering Index MongoDB With covering Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
29. Query Performance when flattened.
• That doesn’t look right.
MongoDB No Index Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
"queryPlanner" : {
"winningPlan" : {
"stage" : "COLLSCAN”,
"direction" : "forward”
}}
30. Flat, wide data.
• Need to persuade aggregation to use the index
• Add a query ( cost > 0) or sort by cost at the start
• Still slower than document model ?
• Document model is efficient.
• This data is actually MOST of the database 110M Entries
• Imagine if our index was a small percentage of the data.
• Index compression has a cost when reading.
No Index Index
509 Seconds (vs 54) 177 Seconds (vs 54)
6% CPU 1700IOPS 30MB/s 25% CPU 0 IOPS 0MB/s
31. Table Layout
RDBMS
• Lots of fixed size rows in a file
• Nice predictable layout
MongoDB
• Variable Length rows in a file
• Less predictable layout
32. Table Layout – The Truth
• RDBMS and MongoDB both store records in Trees
• Records are in some ways, just like indexes.
33. Table Layout – The Truth
RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Index on identity
• Can only walk the tree
• Slower to collection scan
• Less lock contention
34. Table Layout – The Truth
RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Organised by Identity (int64)
• No links between leaves
• Slower to scan everything
• Much less lock contention
35. In-Document rollup.
• We have multiple data items in each document.
• Add summaries of cost in each document?
• No cost when updating anyway $max,$min,$sum,$count.
• RDBMS equivalent has big cost.
• You need to know in advance, or add as needed
• Like an RDBMS index
• What if we index the in-document rollup?
36. MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 18 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
37. MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 0.01 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
38. So far…
• When Data fits in RAM
• RDBMS Table scan faster than Mongo Collection scan
• RDBMS Index scan faster than RDMBS Table scan
• Large MongoDB Index scan isn’t solution
• In document rollups beat RDBMS Index scan
• Index scan of in-document rollups is really quick.
40. What if it wasn’t all about the CPU?
• Data Lakes and “Big Data”
• Limited by reading data from disk
• Limited by Parallelism
• New Experiment Time.
• Reduce RAM to much less than Data Size*
• Work with disk bound data.
• Still one CPU.
41. Table/Collection scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
157 Seconds 61 Seconds
3.5% CPU 1253 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s
42. Why is MongoDB faster
MySQL
• Data Size = 15 GB
MongoDB
• Data Size = 5GB*
• Minimal decompression overhead
*Not ‘Big’ Data
43. Index scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
41 Seconds 61 Seconds
25% CPU 1020 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s
Add an
Index
45. More Complex Queries
From RAM
• May still use Disk for temp tables,
Storage etc.
• All Tables and Indexes fit in RAM
From DISK
Data does NOT fit in RAM
Some indexes MAY be in RAM
No indexes used for MongoDB
46. With Group BY (RAM)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
63 Seconds 60 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
30 Seconds (Index)
25% CPU 0 IOPS 0 MB/s
47. With Group BY (Disk)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
39 Seconds (index)
18% CPU 1010 IOPS 103 MB/s
160 Seconds 63 Seconds
10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
48. Top 10 practices by spend (RAM)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
51 Seconds (index)
18% CPU 1010 IOPS 103 MB/s
75 Seconds 63 Seconds
10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
49. Top 10 practices by spend (Disk)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
160 Seconds 64 Seconds
10% CPU 1150 IOPS 104 MB/s 25% CPU 650 IOPS 82 MB/s
55 Seconds (indexed)
21% CPU 724 IOPS 77 MB/s
50. £ per patient – JOIN and Group (RAM)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}
105 Seconds 62 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
51. £ per patient – JOIN and Group (Disk)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}
160 Seconds 62 Seconds
17% CPU 1200 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
52. £/patient/county – nested SELECT (RAM)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
db.prescriptions.aggregate([
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])
160 Seconds 66 Seconds
24% 0IOPS 0IOPS 24% CPU 0 IOPS 0 MBs
53. Result
Spend (£M) Patients Per Patient(£)
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162
54. Result
Spend (£M) Patients Per Patient(£)
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162
Spend (£M) Patients Per Patient(£)
BERKSHIRE 102 944538 108
MIDDLESEX 150 1469189 102
BRISTOL 14 145660 97
LONDON 522 5672564 92
LEEDS 9 122785 74
55. £/patient/county – nested SELECT (Disk)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
db.prescriptions.aggregate([
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])
220 Seconds 67 Seconds
21% CPU 700 IOPS 68 MB/s 23% CPU 640 IOPS 82 MB/s
56. Most common drugs – REGROUP (RAM)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
300 Seconds 262 Seconds
23% CPU 0 IOPS 0MB/s 25% CPU 0 IOPS 0MB/s
126 Seconds (Indexed)
25% CPU 0 IOPS 0MB/S
57. Grouping Techniques
SQL
Can take advantage of index ordering
by group key, all items with same key
come together so can process one at a
time.
1,1,1,1,1,2,2,2,2,3,3
Uses a temp table and sort when it
can’t.
MongoDB
Does not take advantage of ordering
(yet) – maintains a data structure
with all values.
Assumed you will want to group further
down the pipeline so optimised for
that.
Builds a tree (using disk) for the
values.
59. Most common drugs – REGROUP (Disk)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
1427 Seconds 262 Seconds
4% CPU 1800 IOPS 100 MB/S 24% CPU 180 IOPS 23 MB/s
192 Seconds (Index)
13% CPU 520 IOPS 56MB/s
60. Most Expensive Drugs - Result
Rivaroxaban_Tab 20mg Anti Coagulent £100,007,025
Apixaban_Tab 5mg Anti Coagulent £79,302,385
Fostair_Inh 100mcg/6mcg (120D) C Asthma £75,541,726
Tiotropium_Pdr For Inh Cap 18mcg COPD £66,348,167
Sitagliptin_Tab 100mg Diabetes £60,919,725
Symbicort_Turbohaler 200mcg/6mcg Asthma £44,314,887
Apixaban_Tab 2.5mg Anti Coagulent £41,290,937
Ins Lantus SoloStar_100u/ml 3ml Diabetes £41,182,602
Ezetimibe_Tab 10mg Cholesterol £40,756,234
Linagliptin_Tab 5mg Diabetes £38,503,893
61. Anomaly Detection – JOIN Derived (RAM)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
62. Anomaly Detection – JOIN Derived (RAM)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
2250 Seconds 1489 Seconds
24% CPU 1020 IOPS 90MB/s 25% CPU 0 IOPS 0MB/s
63. Results
DRUG PRACTICE RATIO
Methadone FULCRUM 297 Rehab
Trazodone CARE HOMES
MEDICAL
242 Elderly Care
Buprenorphine FULCRUM 233
Thickenup PDR CARE HOMES
MEDICAL
174
Vitrex Nitrile Gloves REETH MEDICAL 173 Preference?
Ema Film Gloves NEW SPrintwells1 168
Pro D3 Cap PALFREY HEALTH
CENTRE
123 Vitamin D?
Fultium D3 Cap MOHANTY 123 Vitamin D
65. Anomaly Detection – JOIN Derived (Disk)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
7848 Seconds 1655 Seconds
24% CPU 1020 IOPS 90MB/s 25% CPU 170 IOPS 24MB/s
66. Conclusions
• MongoDB is faster from disk when there are no indexes
• MongoDB is generally faster for more complex queries
• MongoDB fits the data-lake model nicely.
73. BI Connector
• Total Spend By Period.
• Sum one column Group by Primary Key
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 39 63
RAM 63 30 60 187
74. BI Connector
• Total Spend By PRACTICE.
• Sum one column Group by PART OF KEY
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 55 62
RAM 75 51 62 230
75. BI Connector
• Total Spend By PATIENT.
• Sum one column Group BY JOINED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 62
RAM 105 62 230
76. BI Connector
• AVG Spend By SPEND PER PATIENT PER COUNTY .
• Sum one column Group BY COMPUTED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 220 67
RAM 160 66 220
77. BI Connector
• MOST prescribed drugs.
• Sum one column Group BY Not PK
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 1427 192 262
RAM 300 126 262 280
78. BI Connector
• Anamoly Detection.
• Sum one column Group from subquery Joined table
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 7848 1655
RAM 2250 1489 D.N.F !
79. BI Connector – Did Not Finish – why?
• Anomaly Detection.
• Did Run – but was taking a long time
• Using expressive $lookup for the join
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}
80. BI Connector – why Did Not Finish
• Anomaly Detection.
• Did run – but was taking a long time
• Using expressive $lookup for the join
• No Index on in-memory table
• Hand written version
• used temp collection
• Made use _id was lookup field
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}
81. Conclusions
• BI Connector a little slower than RDBMS for simple queries
• The more complicated the query, the faster it is relatively.
• It’s not as quick as hand crafted Aggregation
• But you can put that in views.
• But it’s very convenient
• You can use your existing BI Tooling
• And you could always use Charts instead.
84. Pros
• Simpler to write much more complicated processing.
• Lots of libraries of pre written code
• Able to perform a lot of in-memory computation
• MongoDB can send them data very, very quickly
85. Cons
• Costs of transferring from or inside
clouds
• Atlas
• AWS
• Network speed limitations.
• Additional hardware.
• Security considerations.
AWS Same Region 1 cent / GB
AWS Between
Regions
9 cents / GB
AWS Out to 11 cents / GB
86. So do I use Spa^HR^doop! Or not?
• Yes – those tools are great for many things
• But always push computation DOWN to MongoDB if you can
• There is a balance
• Amount of effort to write as a Pipeline
• Reduced network costs in time and money
87. Simple Example
• Pearsons RHO
• Degree of correlation between two numeric lists
• Lets compare Lattitude (North vs South)
• And Quantity of drug per person
• Hypothesis “For some drugs, more is prescribed as you travel, up or down the UK”
88. Step 1 - Geocoding
• We need to augment our records with Lat/Long
• Download a handy set of postcode centroids
• mongoimport --type csv --headerline -d nhs postcodes.csv
• Use $lookup and $out to attach to each record and make new collection.
90. Step 2 – Group by drug
• For each drug compute average quantity/10,000 patients per
surgery
• Group to one record per drug with an array of objects
{
drug : “Aspirin”,
prescribed : [
{ where : [ -3.5, 55.2],
per10k : 75.4 }
…
]
}
91. Group by BNF code
unwind ={ $unwind:"$prescriptions"}
regroup = { $group : {
_id : "$prescriptions.name",
p : { $push : { where : ["$lon","$lat"] ,
per10k : { $multiply: [10000,
{$divide : [ "$prescriptions.nitems",
"$numpatients"]}]}}}}}
92. Step 1 – Pearsons Rho
• Compute RHO on Array comparing per10K to latitude.
100. Conclusions
• The Aggregation Framework is fast.
• There is no truth to “RDBMS is Just better”
• It’s a good choice for non trivial, ad-hoc queries.
• It’s a good choice for large data sets
• Consider sharding and microsharding.
• In a Cloud world – push work to the database
• Even with R/SAS/Spark! Etc.
Editor's Notes
Intro – Shard N – Tech heavy – understandable by all
Not my usual talk, eviews 40% too easy, 40% too hard – this is for you 20% in middle
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Imagine – all things being equal. You want to build a Data warehouse, or a Data Lake or – you don’t really Know just report on the data you have.
How does MongoDB compare to a more traditional approach.
Although you do have to differentiae between OLAP and Warehouseing etc.
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?