This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
Jeremy Mikola from 10gen discussed new features in MongoDB 2.4 related to sharding. He covered shard keys, hashed shard keys, and tag-aware sharding. Shard keys are used to partition data across shards and must have certain properties. Hashed shard keys provide better distribution by hashing the shard key value. Tag-aware sharding allows associating shard key ranges with specific shards to control data location for reasons like performance or legal requirements.
Choosing a shard key can be difficult, and the factors involved largely depend on your use case. In fact, there is no such thing as a perfect shard key; there are design tradeoffs inherent in every decision. This presentation goes through those tradeoffs, as well as the different types of shard keys available in MongoDB, such as hashed and compound shard keys
MySQL Without the SQL - Oh My! -> MySQL Document Store -- Confoo.CA 2019Dave Stokes
MySQL an be used as a NoSQL JSON Document Store as well as its well known ability as a SQL Relational Data Base. This presentation covers why you would want to use NoSQL and JSON and how to combine it what the relational data you already have
Back to Basics Spanish Webinar 3 - Introducción a los replica setsMongoDB
Cómo crear un clúster de producción
Cómo crear un replica set
Cómo MongoDB gestiona la persistencia de los datos y cómo un conjunto de réplicas se recupera automáticamente de todo tipo de fallos
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based ShardingMongoDB
In version 2.4, MongoDB introduces hash-based sharding, allowing the user to shard based on a randomized shard key to spread documents evenly across a cluster. Hash-based sharding is an alternative to range-based sharding, making it easier to manage your growing cluster. In this talk, we'll discuss provide an overview of this new feature and discuss the pros and cons of using a hash-based sharding vs. range-based approach.
The document discusses MongoDB performance optimization strategies presented by Moshe Kaplan at a VP R&D Open Seminar. It covers topics like sharding, in-memory databases, MapReduce, profiling, indexes, server stats, schema design, and locking in MongoDB. Slides include information on tuning configuration parameters, analyzing profiling results, explain plans, index management, and database stats.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
Jeremy Mikola from 10gen discussed new features in MongoDB 2.4 related to sharding. He covered shard keys, hashed shard keys, and tag-aware sharding. Shard keys are used to partition data across shards and must have certain properties. Hashed shard keys provide better distribution by hashing the shard key value. Tag-aware sharding allows associating shard key ranges with specific shards to control data location for reasons like performance or legal requirements.
Choosing a shard key can be difficult, and the factors involved largely depend on your use case. In fact, there is no such thing as a perfect shard key; there are design tradeoffs inherent in every decision. This presentation goes through those tradeoffs, as well as the different types of shard keys available in MongoDB, such as hashed and compound shard keys
MySQL Without the SQL - Oh My! -> MySQL Document Store -- Confoo.CA 2019Dave Stokes
MySQL an be used as a NoSQL JSON Document Store as well as its well known ability as a SQL Relational Data Base. This presentation covers why you would want to use NoSQL and JSON and how to combine it what the relational data you already have
Back to Basics Spanish Webinar 3 - Introducción a los replica setsMongoDB
Cómo crear un clúster de producción
Cómo crear un replica set
Cómo MongoDB gestiona la persistencia de los datos y cómo un conjunto de réplicas se recupera automáticamente de todo tipo de fallos
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based ShardingMongoDB
In version 2.4, MongoDB introduces hash-based sharding, allowing the user to shard based on a randomized shard key to spread documents evenly across a cluster. Hash-based sharding is an alternative to range-based sharding, making it easier to manage your growing cluster. In this talk, we'll discuss provide an overview of this new feature and discuss the pros and cons of using a hash-based sharding vs. range-based approach.
The document discusses MongoDB performance optimization strategies presented by Moshe Kaplan at a VP R&D Open Seminar. It covers topics like sharding, in-memory databases, MapReduce, profiling, indexes, server stats, schema design, and locking in MongoDB. Slides include information on tuning configuration parameters, analyzing profiling results, explain plans, index management, and database stats.
This document summarizes MongoDB's recent release history and highlights of upcoming releases. It outlines key features from releases 2.0 through 2.4 including journaling, sharding, indexing improvements, and the aggregation framework. Release 2.2 focused on concurrency improvements like yielding and database-level locking as well as the new aggregation framework. Future releases will focus on areas like security, geospatial indexing, and distributed non-sharded collections. Community involvement through testing, feedback, and local user groups is encouraged.
I inherited a MongoDB database server with 60 collections and 100 or so indexes.
The business users are complaining about slow report completion times. What can I do to improve performance?
1) MongoDB databases can grow very large due to flexible document schemas that allow large and denormalized data. This leads to increased storage requirements.
2) MongoDB replication can introduce lag on secondary nodes as they process write operations. This limits the ability to use secondary nodes for scaling reads.
3) MongoDB performance declines dramatically when indexes do not fit in memory, requiring more RAM, sharding, or reduced write performance.
4) MongoDB implements database-level locking, limiting write concurrency and the ability to run multiple shards on a single server.
5) MongoDB does not support ACID transactions, multi-version concurrency control, or consistent reads in the presence of concurrent writes. Workarounds
This document provides an overview of MongoDB basics, including:
- A history of MongoDB and how it enables working with non-structured data and real-time analytics.
- MongoDB's ranking as the highest placed non-relational database and as a "Challenger" to relational databases.
- How MongoDB works using a clustered architecture with shards, replica sets, config servers, and mongos processes to provide scalability, high availability, and load balancing.
- Key MongoDB concepts like documents, collections, embedded documents, and schema flexibility compared to a traditional SQL schema.
- MongoDB utilities for backup, restore, and monitoring like mongoexport, mongorestore, mongostat, and mongotop.
Elasticsearch is quite common tool nowadays. Usually as a part of ELK stack, but in some cases to support main feature of the system as search engine. Documentation on regular use cases and on usage in general is pretty good, but how it really works, how it behaves beneath the surface of the API? This talk is about that, we will look under the hood of Elasticsearch and dive deep in the largely unknown implementation details. Talk covers cluster behaviour, communication with Lucene and Lucene internals to literally bits and pieces. Come and see Elasticsearch dissected.
This document discusses tag-based sharding in MongoDB. It provides an overview of MongoDB clusters and components like replica sets, shards, config servers, and mongos. It also explains key concepts like chunks and migrations. Additionally, it outlines the steps to split chunks, migrate data between shards, and perform normal MongoDB operations. Finally, it covers how to implement tag-based sharding by tagging shards and defining tag ranges for chunks.
The document discusses key concepts related to MongoDB including its characteristics, replication, sharding, and utilities. MongoDB is a fast, flexible, scalable database designed to reduce administrative tasks through features like replica sets, sharding, and disaster recovery. It also includes powerful analysis tools and indexes that allow for queries on non-relational data structures.
DataStax: Making Cassandra Fail (for effective testing)DataStax Academy
This document discusses testing Cassandra applications by making Cassandra fail deterministically. It introduces Stubbed Cassandra, a tool that allows priming Cassandra to respond to queries and prepared statements with different failures like timeouts, unavailability, and coordinator issues. Tests can verify application behavior and retries under failures. Stubbed Cassandra runs as a separate process and exposes REST APIs to prime failures and verify query activity, allowing integration into testing frameworks. It aims to help test edge cases and fault tolerance more effectively than existing Cassandra testing tools.
Cassandra is great but how do I test my application?Christopher Batey
The document discusses testing Cassandra applications and introduces Stubbed Cassandra, a tool for testing Cassandra drivers. Stubbed Cassandra allows priming a fake Cassandra server to return expected responses or errors. It can be used to test failure scenarios, edge cases, and retry policies. The tool has Java and REST APIs and makes it possible to quickly test Cassandra applications in a deterministic and non-brittle way.
Indexes are references to documents that are efficiently ordered by key and maintained in a tree structure for fast lookup. They improve the speed of document retrieval, range scanning, ordering, and other operations by enabling the use of the index instead of a collection scan. While indexes improve query performance, they can slow down document inserts and updates since the indexes also need to be maintained. The query optimizer aims to select the best index for each query but can sometimes be overridden.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Back to Basics 2017: Mí primera aplicación MongoDBMongoDB
Descubra:
Cómo instalar MongoDB y usar el shell de MongoDB
Las operaciones básicas de CRUD
Cómo analizar el rendimiento de las consultas y añadir un índice
Cassandra Summit EU 2014 - Testing Cassandra ApplicationsChristopher Batey
The document discusses testing Cassandra applications. It introduces Stubbed Cassandra, a tool that allows embedding a Cassandra server in unit tests to prime responses and verify queries. It can start multiple instances in a test, prime different result types, and verify connections and prepared statements. Example tests show how to test queries, errors, retries, and slow connections. Future features may include loading schemas and simulating multiple nodes.
This document provides an overview of performance tuning and optimization in MongoDB. It defines performance tuning as modifying a system to handle increased load, while optimization is modifying a system to work more efficiently or use fewer resources. Measurement tools discussed include log files, the profiler, query optimizer, and explain plans. Effecting change involves measuring current performance, identifying bottlenecks, removing bottlenecks, remeasuring, and repeating. Possible areas for improvement discussed are schema design, access patterns, indexing, hardware configuration, and instance configuration. The document provides examples and best practices around indexing, access patterns, and hardware tuning.
This document provides an overview and instructions for deploying, upgrading, and troubleshooting a MongoDB sharded cluster. It describes the components of a sharded cluster including shards, config servers, and mongos processes. It provides recommendations for initial deployment including using replica sets for shards and config servers, DNS names instead of IPs, and proper user authorization. The document also outlines best practices for upgrading between minor and major versions, including stopping the balancer, upgrading processes in rolling fashion, and handling incompatible changes when downgrading major versions.
MySQL Without The SQL -- Oh My! PHP Detroit July 2018Dave Stokes
MySQL 8 can be used as a NoSQL JSON document store and with the new X DEVapi you can stop embedding SQL strings in your PHP code. Plus you can also access and JOIN SQL tables
The document provides a roadmap and overview of recent releases of MongoDB including versions 1.8, 2.0, 2.2, and 2.4. It summarizes some of the key features introduced in each release such as journaling, indexing improvements, aggregation framework, sharding enhancements, and performance improvements. It then focuses on details of the 2.2 release including concurrency improvements, the new aggregation framework, TTL collections, tag aware sharding, and read preferences. It concludes by outlining some ongoing work and encouraging participation in the MongoDB community.
Elasticsearch 101 - Cluster setup and tuningPetar Djekic
Elasticsearch 101 provides an overview of setting up, configuring, and tuning an Elasticsearch cluster. It discusses hardware requirements including memory, avoiding high cardinality fields, indexing and querying data, and tooling. The document also covers potential issues like data loss during network partitions and exhausting available Java heap memory.
To understand how to make your application fast, it's important to understand what makes the database fast. We will take a detailed look at how to think about performance, and how different choices in schema design affect your cluster performances depending on storage engines used and physical resources available.
Webinar: Architecting Secure and Compliant Applications with MongoDBMongoDB
High-profile security breaches have become embarrassingly common, but ultimately avoidable. Now more than ever, database security is a critical component of any production application. In this talk you'll learn to secure your deployment in accordance with best practices and compliance regulations. We'll explore the MongoDB Enterprise features which ensure HIPAA and PCI compliance, and protect you against attack, data exposure and a damaged reputation.
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
AOL experienced explosive growth and needed a new database that was both flexible and easy to deploy with little effort. They chose MongoDB. Due to the complexity of internal systems and the data, most of the migration process was spent building a new identity platform and adapters for legacy apps to talk to MongoDB. Systems were migrated in 4 phases to ensure that users were not impacted during the switch. Turning on dual reads/writes to both legacy databases and MongoDB also helped get production traffic into MongoDB during the process. Ultimately, the project was successful with the help of MongoDB support. Today, the team has 15 shards, with 60-70 GB per shard.
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
MapQuest developed a search ahead feature for their mobile app to enable auto-complete searching across their large dataset. They used Solr and implemented various techniques to optimize performance, including custom routing, analysis during ETL, and extensive JVM tuning. Their architecture included multiple Solr clusters with different configurations. Through testing and monitoring, they were able to meet their sub-140ms response time requirement for queries.
This document summarizes MongoDB's recent release history and highlights of upcoming releases. It outlines key features from releases 2.0 through 2.4 including journaling, sharding, indexing improvements, and the aggregation framework. Release 2.2 focused on concurrency improvements like yielding and database-level locking as well as the new aggregation framework. Future releases will focus on areas like security, geospatial indexing, and distributed non-sharded collections. Community involvement through testing, feedback, and local user groups is encouraged.
I inherited a MongoDB database server with 60 collections and 100 or so indexes.
The business users are complaining about slow report completion times. What can I do to improve performance?
1) MongoDB databases can grow very large due to flexible document schemas that allow large and denormalized data. This leads to increased storage requirements.
2) MongoDB replication can introduce lag on secondary nodes as they process write operations. This limits the ability to use secondary nodes for scaling reads.
3) MongoDB performance declines dramatically when indexes do not fit in memory, requiring more RAM, sharding, or reduced write performance.
4) MongoDB implements database-level locking, limiting write concurrency and the ability to run multiple shards on a single server.
5) MongoDB does not support ACID transactions, multi-version concurrency control, or consistent reads in the presence of concurrent writes. Workarounds
This document provides an overview of MongoDB basics, including:
- A history of MongoDB and how it enables working with non-structured data and real-time analytics.
- MongoDB's ranking as the highest placed non-relational database and as a "Challenger" to relational databases.
- How MongoDB works using a clustered architecture with shards, replica sets, config servers, and mongos processes to provide scalability, high availability, and load balancing.
- Key MongoDB concepts like documents, collections, embedded documents, and schema flexibility compared to a traditional SQL schema.
- MongoDB utilities for backup, restore, and monitoring like mongoexport, mongorestore, mongostat, and mongotop.
Elasticsearch is quite common tool nowadays. Usually as a part of ELK stack, but in some cases to support main feature of the system as search engine. Documentation on regular use cases and on usage in general is pretty good, but how it really works, how it behaves beneath the surface of the API? This talk is about that, we will look under the hood of Elasticsearch and dive deep in the largely unknown implementation details. Talk covers cluster behaviour, communication with Lucene and Lucene internals to literally bits and pieces. Come and see Elasticsearch dissected.
This document discusses tag-based sharding in MongoDB. It provides an overview of MongoDB clusters and components like replica sets, shards, config servers, and mongos. It also explains key concepts like chunks and migrations. Additionally, it outlines the steps to split chunks, migrate data between shards, and perform normal MongoDB operations. Finally, it covers how to implement tag-based sharding by tagging shards and defining tag ranges for chunks.
The document discusses key concepts related to MongoDB including its characteristics, replication, sharding, and utilities. MongoDB is a fast, flexible, scalable database designed to reduce administrative tasks through features like replica sets, sharding, and disaster recovery. It also includes powerful analysis tools and indexes that allow for queries on non-relational data structures.
DataStax: Making Cassandra Fail (for effective testing)DataStax Academy
This document discusses testing Cassandra applications by making Cassandra fail deterministically. It introduces Stubbed Cassandra, a tool that allows priming Cassandra to respond to queries and prepared statements with different failures like timeouts, unavailability, and coordinator issues. Tests can verify application behavior and retries under failures. Stubbed Cassandra runs as a separate process and exposes REST APIs to prime failures and verify query activity, allowing integration into testing frameworks. It aims to help test edge cases and fault tolerance more effectively than existing Cassandra testing tools.
Cassandra is great but how do I test my application?Christopher Batey
The document discusses testing Cassandra applications and introduces Stubbed Cassandra, a tool for testing Cassandra drivers. Stubbed Cassandra allows priming a fake Cassandra server to return expected responses or errors. It can be used to test failure scenarios, edge cases, and retry policies. The tool has Java and REST APIs and makes it possible to quickly test Cassandra applications in a deterministic and non-brittle way.
Indexes are references to documents that are efficiently ordered by key and maintained in a tree structure for fast lookup. They improve the speed of document retrieval, range scanning, ordering, and other operations by enabling the use of the index instead of a collection scan. While indexes improve query performance, they can slow down document inserts and updates since the indexes also need to be maintained. The query optimizer aims to select the best index for each query but can sometimes be overridden.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Back to Basics 2017: Mí primera aplicación MongoDBMongoDB
Descubra:
Cómo instalar MongoDB y usar el shell de MongoDB
Las operaciones básicas de CRUD
Cómo analizar el rendimiento de las consultas y añadir un índice
Cassandra Summit EU 2014 - Testing Cassandra ApplicationsChristopher Batey
The document discusses testing Cassandra applications. It introduces Stubbed Cassandra, a tool that allows embedding a Cassandra server in unit tests to prime responses and verify queries. It can start multiple instances in a test, prime different result types, and verify connections and prepared statements. Example tests show how to test queries, errors, retries, and slow connections. Future features may include loading schemas and simulating multiple nodes.
This document provides an overview of performance tuning and optimization in MongoDB. It defines performance tuning as modifying a system to handle increased load, while optimization is modifying a system to work more efficiently or use fewer resources. Measurement tools discussed include log files, the profiler, query optimizer, and explain plans. Effecting change involves measuring current performance, identifying bottlenecks, removing bottlenecks, remeasuring, and repeating. Possible areas for improvement discussed are schema design, access patterns, indexing, hardware configuration, and instance configuration. The document provides examples and best practices around indexing, access patterns, and hardware tuning.
This document provides an overview and instructions for deploying, upgrading, and troubleshooting a MongoDB sharded cluster. It describes the components of a sharded cluster including shards, config servers, and mongos processes. It provides recommendations for initial deployment including using replica sets for shards and config servers, DNS names instead of IPs, and proper user authorization. The document also outlines best practices for upgrading between minor and major versions, including stopping the balancer, upgrading processes in rolling fashion, and handling incompatible changes when downgrading major versions.
MySQL Without The SQL -- Oh My! PHP Detroit July 2018Dave Stokes
MySQL 8 can be used as a NoSQL JSON document store and with the new X DEVapi you can stop embedding SQL strings in your PHP code. Plus you can also access and JOIN SQL tables
The document provides a roadmap and overview of recent releases of MongoDB including versions 1.8, 2.0, 2.2, and 2.4. It summarizes some of the key features introduced in each release such as journaling, indexing improvements, aggregation framework, sharding enhancements, and performance improvements. It then focuses on details of the 2.2 release including concurrency improvements, the new aggregation framework, TTL collections, tag aware sharding, and read preferences. It concludes by outlining some ongoing work and encouraging participation in the MongoDB community.
Elasticsearch 101 - Cluster setup and tuningPetar Djekic
Elasticsearch 101 provides an overview of setting up, configuring, and tuning an Elasticsearch cluster. It discusses hardware requirements including memory, avoiding high cardinality fields, indexing and querying data, and tooling. The document also covers potential issues like data loss during network partitions and exhausting available Java heap memory.
To understand how to make your application fast, it's important to understand what makes the database fast. We will take a detailed look at how to think about performance, and how different choices in schema design affect your cluster performances depending on storage engines used and physical resources available.
Webinar: Architecting Secure and Compliant Applications with MongoDBMongoDB
High-profile security breaches have become embarrassingly common, but ultimately avoidable. Now more than ever, database security is a critical component of any production application. In this talk you'll learn to secure your deployment in accordance with best practices and compliance regulations. We'll explore the MongoDB Enterprise features which ensure HIPAA and PCI compliance, and protect you against attack, data exposure and a damaged reputation.
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
AOL experienced explosive growth and needed a new database that was both flexible and easy to deploy with little effort. They chose MongoDB. Due to the complexity of internal systems and the data, most of the migration process was spent building a new identity platform and adapters for legacy apps to talk to MongoDB. Systems were migrated in 4 phases to ensure that users were not impacted during the switch. Turning on dual reads/writes to both legacy databases and MongoDB also helped get production traffic into MongoDB during the process. Ultimately, the project was successful with the help of MongoDB support. Today, the team has 15 shards, with 60-70 GB per shard.
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
MapQuest developed a search ahead feature for their mobile app to enable auto-complete searching across their large dataset. They used Solr and implemented various techniques to optimize performance, including custom routing, analysis during ETL, and extensive JVM tuning. Their architecture included multiple Solr clusters with different configurations. Through testing and monitoring, they were able to meet their sub-140ms response time requirement for queries.
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSEDataStax Academy
The document provides 5 tips for using Cassandra and DSE: 1) Data modeling best practices to avoid secondary indexes, 2) Understanding compaction choices like size-tiered, leveled, and date-tiered and their use cases, 3) Common mistakes in proofs-of-concept like testing on different hardware and empty nodes, 4) Hardware recommendations like using moderate sized nodes with SSDs, and 5) Anti-patterns like loading large batches of data and modeling queues with improper partitioning.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
MongoDB Operational Best Practices (mongosf2012)Scott Hernandez
The document outlines operational best practices learned from analyzing real support cases. It describes 3 scenarios where performance issues were identified: 1) response time timeouts due to disk monitoring and instrumentation issues, 2) high CPU usage due to poorly indexed queries, and 3) general slowdowns due to large disk read-ahead size. Key learnings include monitoring logs and systems, performance testing before deployments, using database profilers and indexes, and planning rollouts and configurations.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
Cassandra is used for real-time bidding in online advertising. It processes billions of bid requests per day with low latency requirements. Segment data, which assigns product or service affinity to user groups, is stored in Cassandra to reduce calculations and allow users to be bid on sooner. Tuning the cache size and understanding the active dataset helps optimize performance.
Using cassandra as a distributed logging to store pb dataRamesh Veeramani
This document discusses using Cassandra for big data event logging. It notes that Cassandra scales incrementally, is highly available, and is well suited for OLTP workloads where write throughput is prioritized over reads. It covers Cassandra's internal workings including token assignment, replication, and compaction strategies. Setup instructions are provided along with benchmarking results. Maintenance tools like Nodetool and stress testing tools are also mentioned. The document concludes that Cassandra is a good candidate for logging systems due to its scalability and ease of adding nodes.
The document discusses various topics related to using MongoDB including schema design, indexing, concurrency, and durability. For schema design, it recommends using small document sizes and separating documents that grow unbounded into multiple collections. For indexing, it emphasizes ensuring queries use indexes and introduces sparse indexes and index-only queries. It notes concurrency is coarse-grained currently but being improved. For durability, it discusses storage, journaling, replication, and write concerns.
Tomas Doran presented on their implementation of Logstash at TIM Group to process over 55 million messages per day. Their applications are all Java/Scala/Clojure and they developed their own library to send structured log events as JSON to Logstash using ZeroMQ for reliability. They index data in Elasticsearch and use it for metrics, alerts and dashboards but face challenges with data growth.
At Databricks, we manage Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
Amazon Aurora adds PostgreSQL compatibility to its cloud-optimized relational database. With PostgreSQL compatibility, customers can now choose to use Amazon's database with the performance and availability of commercial databases and the simplicity and cost-effectiveness of open source databases. Amazon Aurora provides high performance, durability, availability and automatic scaling capabilities for PostgreSQL workloads.
1) Durability, the guarantee that written data will remain readable indefinitely, is a challenging property to ensure, especially in distributed systems.
2) Many factors can undermine durability, including hardware and software failures, performance optimizations, file system and storage vulnerabilities, and human errors like server re-imaging.
3) Distributed streaming systems that replicate data across multiple servers using logs as the source of truth offer a promising approach for achieving durability at scale, as exemplified by Apache BookKeeper which guarantees data remains readable if acknowledged or previously read.
In the world of big data we need to build services that will be able to collect massive data, save it and pass it to processing and analysis. However, building manageable, reliable services that are scalable and cost effective is not an easy task. The choice of eco-system, frameworks and programming language, as well as using solid engineering principles is also crucial for achieving this goal.
I will share our journey and insights from rebuilding a cloud service in Linux eco-system using Scala, Akka Actors and Aerospike DB, at the end of which we gained 10 folds improvement of server usage with a much lighter, stable and reliable system that handles tens of millions of requests per hour.
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
The document discusses managing security events at scale using Elasticsearch. Some key points:
- The author manages security logs for customers, collecting, correlating, storing, indexing, analyzing, and monitoring over 1 million events per second.
- Before Elasticsearch, traditional databases couldn't scale to billions of logs, searches took days, and advanced analytics weren't possible. Elasticsearch allows customers to access and search logs in real-time and perform analytics.
- Their largest Elasticsearch cluster has 128 nodes indexing over 20 billion documents per day totaling 800 billion documents. They use Hadoop for long term storage and Spark and Kafka for real-time analytics.
The document provides an overview of key concepts in system design including:
1) Breaking problems into modules using a top-down approach and discussing trade-offs.
2) Architectural components like load balancing, databases, caching, and data partitioning that are important to consider in system design.
3) Database types like SQL and NoSQL and when each is best suited based on factors like data structure, scalability needs, and development agility.
Learn about structured logging with rsyslog and how it can be used to do actual format conversions. Include config samples for Linux and Windows log sources.
These are the slides from my talk at Hulu in March 2015 discussing Apache Spark & Cassandra. I cover the evolution of data from a single machine to RDBMS (MySQL is the primary example) to big data systems.
On the Spark side, I covered batch jobs, streaming, Apache Kafka, an introduction to machine learning, clustering, logistic regression and recommendations systems (collaborative filtering).
The talk was recorded and is available on youtube: https://www.youtube.com/watch?v=_gFgU3phogQ
A presentation about the deployment of an ELK stack at bol.com
At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure.
The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling.
These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.
Similar to Cassandra For Online Systems, What actually works (20)
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
Odoo releases a new update every year. The latest version, Odoo 17, came out in October 2023. It brought many improvements to the user interface and user experience, along with new features in modules like accounting, marketing, manufacturing, websites, and more.
The Odoo 17 update has been a hot topic among startups, mid-sized businesses, large enterprises, and Odoo developers aiming to grow their businesses. Since it is now already the first quarter of 2024, you must have a clear idea of what Odoo 17 entails and what it can offer your business if you are still not aware of it.
This blog covers the features and functionalities. Explore the entire blog and get in touch with expert Odoo ERP consultants to leverage Odoo 17 and its features for your business too.
An Overview of Odoo ERP
Odoo ERP was first released as OpenERP software in February 2005. It is a suite of business applications used for ERP, CRM, eCommerce, websites, and project management. Ten years ago, the Odoo Enterprise edition was launched to help fund the Odoo Community version.
When you compare Odoo Community and Enterprise, the Enterprise edition offers exclusive features like mobile app access, Odoo Studio customisation, Odoo hosting, and unlimited functional support.
Today, Odoo is a well-known name used by companies of all sizes across various industries, including manufacturing, retail, accounting, marketing, healthcare, IT consulting, and R&D.
The latest version, Odoo 17, has been available since October 2023. Key highlights of this update include:
Enhanced user experience with improvements to the command bar, faster backend page loading, and multiple dashboard views.
Instant report generation, credit limit alerts for sales and invoices, separate OCR settings for invoice creation, and an auto-complete feature for forms in the accounting module.
Improved image handling and global attribute changes for mailing lists in email marketing.
A default auto-signature option and a refuse-to-sign option in HR modules.
Options to divide and merge manufacturing orders, track the status of manufacturing orders, and more in the MRP module.
Dark mode in Odoo 17.
Now that the Odoo 17 announcement is official, let’s look at what’s new in Odoo 17!
What is Odoo ERP 17?
Odoo 17 is the latest version of one of the world’s leading open-source enterprise ERPs. This version has come up with significant improvements explained here in this blog. Also, this new version aims to introduce features that enhance time-saving, efficiency, and productivity for users across various organisations.
Odoo 17, released at the Odoo Experience 2023, brought notable improvements to the user interface and added new functionalities with enhancements in performance, accessibility, data analysis, and management, further expanding its reach in the market.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...Luigi Fugaro
Vector databases are transforming how we handle data, allowing us to search through text, images, and audio by converting them into vectors. Today, we'll dive into the basics of this exciting technology and discuss its potential to revolutionize our next-generation AI applications. We'll examine typical uses for these databases and the essential tools
developers need. Plus, we'll zoom in on the advanced capabilities of vector search and semantic caching in Java, showcasing these through a live demo with Redis libraries. Get ready to see how these powerful tools can change the game!
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
Manyata Tech Park Bangalore_ Infrastructure, Facilities and Morenarinav14
Located in the bustling city of Bangalore, Manyata Tech Park stands as one of India’s largest and most prominent tech parks, playing a pivotal role in shaping the city’s reputation as the Silicon Valley of India. Established to cater to the burgeoning IT and technology sectors
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid
IBM watsonx Code Assistant for Z, our latest Generative AI-assisted mainframe application modernization solution. Mainframe (IBM Z) application modernization is a topic that every mainframe client is addressing to various degrees today, driven largely from digital transformation. With generative AI comes the opportunity to reimagine the mainframe application modernization experience. Infusing generative AI will enable speed and trust, help de-risk, and lower total costs associated with heavy-lifting application modernization initiatives. This document provides an overview of the IBM watsonx Code Assistant for Z which uses the power of generative AI to make it easier for developers to selectively modernize COBOL business services while maintaining mainframe qualities of service.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
DevOps Consulting Company | Hire DevOps Servicesseospiralmantra
Spiral Mantra excels in providing comprehensive DevOps services, including Azure and AWS DevOps solutions. As a top DevOps consulting company, we offer controlled services, cloud DevOps, and expert consulting nationwide, including Houston and New York. Our skilled DevOps engineers ensure seamless integration and optimized operations for your business. Choose Spiral Mantra for superior DevOps services.
https://www.spiralmantra.com/devops/
Superpower Your Apache Kafka Applications Development with Complementary Open...Paul Brebner
Kafka Summit talk (Bangalore, India, May 2, 2024, https://events.bizzabo.com/573863/agenda/session/1300469 )
Many Apache Kafka use cases take advantage of Kafka’s ability to integrate multiple heterogeneous systems for stream processing and real-time machine learning scenarios. But Kafka also exists in a rich ecosystem of related but complementary stream processing technologies and tools, particularly from the open-source community. In this talk, we’ll take you on a tour of a selection of complementary tools that can make Kafka even more powerful. We’ll focus on tools for stream processing and querying, streaming machine learning, stream visibility and observation, stream meta-data, stream visualisation, stream development including testing and the use of Generative AI and LLMs, and stream performance and scalability. By the end you will have a good idea of the types of Kafka “superhero” tools that exist, which are my favourites (and what superpowers they have), and how they combine to save your Kafka applications development universe from swamploads of data stagnation monsters!
Orca: Nocode Graphical Editor for Container OrchestrationPedro J. Molina
Tool demo on CEDI/SISTEDES/JISBD2024 at A Coruña, Spain. 2024.06.18
"Orca: Nocode Graphical Editor for Container Orchestration"
by Pedro J. Molina PhD. from Metadev
17. § Flexible schema
§ 4th major iteration over 10 years
§ Schema still adjusted relatively often (6 mo)
18. § API stats:
§ 300M API requests / peak day
§ 300K API requests / peak minute
§ 150M API requests / off-peak day
§ DB stats:
§ 1.5B reads / peak day
§ 58K reads / sec (peak)
§ 10M writes / peak day
19. § DB stats:
§ 20TB of data (without 3x replication)
§ 7.5TB of that is canonical
§ 12.5TB is derivative, denormalized for queries
§ DB size:
§ 60TB of disk used (replication factor = 3)
§ Able to drop most derivative data in emergency
21. § Events are CANONICAL
§ Multiple, derivative views
§ View computed from events
§ Views can be deleted
(recomputed from events)
§ Views stored in DB
§ For faster reads
§ Event Sourcing
https://martinfowler.com/eaaDev/EventSourcing.html
Events View
22. § Views optimized for Read
§ 100 reads : 1 write
§ Different use case?
§ Might justify a new view
§ Might just change views
§ Family Tree views
§ Person Card (summary)
§ Full Person View
§ Change History
Events View
38. § Solution #1:
§ Reduce size of views
§ Family Tree data limits (control) & data cleanup (fix)
§ Emergency blacklist for certain records
until they can be manually trimmed
§ Solution #2:
§ Throttle duplicate requests
§ Throttle problem clients
§ Reduce rate of requests to specific replica set
39. § Solution #3:
§ Spread views by prepending key prefix
§ Events on different set of nodes than views
§ Put each type of view on different set of nodes
§ Spread traffic out
§ Solution #4:
§ Prevent merge / edit wars (limits)
§ Emergency lock records / suspend accounts
40. § Solution #5:
§ Split command up into contiguous events
§ Avoid over-copying large transactions
§ Split batches when writing
§ Retry writes if writes time out (janitor & queue)
§ Solution #6:
§ Change view tables to LCS (leveled compaction)
§ Lower gc_grace_seconds for view tables to 2d
§ Emergency: Truncate view tables
45. § Concurrent writes:
§ Concurrent incremental view refresh
§ Different view snapshots read (different nodes)
§ Overlapping view writes
§ Missing view data (as if write never happened)
§ Partial writes:
§ Timeout on complex many-record write
§ Janitor not yet caught up replaying write
§ User refreshes and attempts again
46. § Solution #1:
§ Observe view UUID during event preparation
§ Observe view UUID during write
§ Revert if different (concurrent write conflict)
§ Solution #2:
§ Spark job to find inconsistencies
§ Semi-automated categorization & fixup
§ Address each source of inconsistency
47. § Fantastic peak day performance
§ Data consistency is good enough
§ Consistency checks catching issues
§ Quality of Family Tree improved with cleanups
§ Splitting events / view – lots of flexibility
§ Flexible schema – allows for agility
§ Takes abuse from users and keeps running
50. § Event Sourced data model:
§ Very performant & scalable
§ Good enough consistency
§ NTP time:
§ Must monitor / alert
§ Must deal with small offsets
§ Consistency checks:
§ Long-term consistency must be measured
§ Fixes for measured issues must be applied