Tips, tricks, and gotchas learned at Localytics for optimizing MongoDB installs. Includes information about document design, indexes, fragmentation, migration, AWS EC2/EBS, and more.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
In this talk, we will discuss Happn's war story about migrating a Cassandra 2.1 cluster containing more than 68 Billion records in a counter table to ScyllaDB Open Source.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScyllaDB
Beyond the immediate schema changes supported in Scylla Open Source 5.0, learn how the Raft consensus infrastructure will enable radical new capabilities. Discover how it will enable more dynamic topology changes, tablets, immediate consistency, better and faster elasticity, and simplification to repair operations.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
In this talk, we will discuss Happn's war story about migrating a Cassandra 2.1 cluster containing more than 68 Billion records in a counter table to ScyllaDB Open Source.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScyllaDB
Beyond the immediate schema changes supported in Scylla Open Source 5.0, learn how the Raft consensus infrastructure will enable radical new capabilities. Discover how it will enable more dynamic topology changes, tablets, immediate consistency, better and faster elasticity, and simplification to repair operations.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
You have a working application that is using MySQL: great! At the beginning, you are probably using a single database instance, and maybe – but not necessarily – you have replication for backups, but you are not reading from slaves yet. Scalability and reliability were not the main focus in the past, but they are starting to be a concern. Soon, you will have many databases and you will have to deal with replication lag. This talk will present how to tackle the transition.
We mostly cover standard/asynchronous replication, but we will also touch on Galera and Group Replication. We present how to adapt the application to become replication-friendly, which facilitate reading from and failing over to slaves. We also present solutions for managing read views at scale and enabling read-your-own-writes on slaves. We also touch on vertical and horizontal sharding for when deploying bigger servers is not possible anymore.
Are UNIQUE and FOREIGN KEYs still possible at scale, what are the downsides of AUTO_INCREMENTs, how to avoid overloading replication, what are the limits of archiving, … Come to this talk to get answers and to leave with tools for tackling the challenges of the future.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we have observed that the scale of state that is managed by Flink in production is constantly growing. This development created new challenges for state management in Flink, in particular for state checkpointing, which is the core of Flink's fault tolerance mechanism. Two of the most important problems that we had to solve were the following: (i) how can we limit the duration and size of checkpoints to something that does not grow linearly in the size of the state and (ii) how can we take checkpoints without blocking the processing pipeline in the meantime? We have implemented incremental checkpoints to solve the first problem by checkpointing only the changes between checkpoints, instead of always recording the whole state. Asynchronous checkpoints address the second problem and enable Flink to continue processing concurrently to running checkpoints. In this talk, we will take a deep dive into the details of Flink's new checkpointing features. In particular, we will talk about the underlying datastructures, log-structured merge trees and copy-on-write hash tables, and how those building blocks are assembled and orchestrated to advance Flink's checkpointing.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
Kafka has become the de facto standard for streaming data with high-throughput, low-latency, and fault-tolerance. However, its rising adoption raises new challenges. In particular, the growing cluster sizes, increasing volume and diversity of user traffic, and aging network and server components induce an overhead in managing the system. This overhead makes it infeasible for human operators to constantly monitor, identify, and mitigate issues. The resulting utilization imbalance across brokers leads to unpredictable client performance due to the high variation in their throughput and latency. Finally, properly expanding, shrinking, or upgrading clusters also incurs a management overhead. Hence, adopting a principled approach to manage Kafka clusters is integral to the sustainability of the infrastructure.
This talk will describe how LinkedIn alleviates the management overhead of large-scale Kafka clusters using Cruise Control. To this end, first, we will discuss the reactive and proactive techniques that Cruise Control uses to support admin operations for cluster maintenance, enable anomaly detection with self-healing, and provide real-time monitoring for Kafka clusters. Next, we will examine how Cruise Control performs in production. Finally, we will conclude with questions and further discussion.
Kafka streams windowing behind the curtain confluent
Kafka Streams Windowing Behind the Curtain, Neil Buesing, Principal Solutions Architect, Rill
https://www.meetup.com/TwinCities-Apache-Kafka/events/279316299/
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
In the last 24 months, MySQL/MariaDB replication speed has improved a lot thanks to parallel replication. MySQL and MariaDB have different types of parallel replication; in this talk, I present the different implementations, with their limitations and the corresponding tuning parameters. I cover what to do to make parallel replication faster and what to avoid for maximizing parallel replication benefits. I also present benchmark results from real Booking.com workloads. Finally, I discuss some deployments at Booking.com that take advantage of parallel replication speed improvements.
About a wide range of spectrum of Data Consistency models.
Created thanks to a lot of smart and generous people that shared their knowledge and insights into the subject. Thanks to you all !
Hope this can help you scratch the surface of the subject of consistency models too.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
Choosing the right Professional Employer Organization will help your business remain in compliance, leverage the efficiencies of great technology, and facilitate access to comprehensive capabilities that will benefit your business and its employees. So what are the most basic and key things you need to know about PEO?
The Top Six Early Detection and Action Must-Haves for Improving OutcomesHealth Catalyst
Given the industry’s shift toward value-based, outcomes-based healthcare, organizations are working to improve outcomes. One of their top outcomes improvement priorities should be early detection and action, which can significantly improve clinical, financial, and patient experience outcomes. Through early detection and action, systems embrace a proactive approach to healthcare that aims to prevent illness; the earlier a condition is detected, the better the outcome.
But, as with most things in healthcare, improving early detection is easier said than done. This executive report provides helpful, actionable guidance about overcoming common barriers (logistical, cultural, and technical) and improving early detection and action by integrating six must-haves:
Multidisciplinary teams
Analytics
Leadership-driven culture change
Creative customization
Proof-of-concept pilot projects
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
You have a working application that is using MySQL: great! At the beginning, you are probably using a single database instance, and maybe – but not necessarily – you have replication for backups, but you are not reading from slaves yet. Scalability and reliability were not the main focus in the past, but they are starting to be a concern. Soon, you will have many databases and you will have to deal with replication lag. This talk will present how to tackle the transition.
We mostly cover standard/asynchronous replication, but we will also touch on Galera and Group Replication. We present how to adapt the application to become replication-friendly, which facilitate reading from and failing over to slaves. We also present solutions for managing read views at scale and enabling read-your-own-writes on slaves. We also touch on vertical and horizontal sharding for when deploying bigger servers is not possible anymore.
Are UNIQUE and FOREIGN KEYs still possible at scale, what are the downsides of AUTO_INCREMENTs, how to avoid overloading replication, what are the limits of archiving, … Come to this talk to get answers and to leave with tools for tackling the challenges of the future.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we have observed that the scale of state that is managed by Flink in production is constantly growing. This development created new challenges for state management in Flink, in particular for state checkpointing, which is the core of Flink's fault tolerance mechanism. Two of the most important problems that we had to solve were the following: (i) how can we limit the duration and size of checkpoints to something that does not grow linearly in the size of the state and (ii) how can we take checkpoints without blocking the processing pipeline in the meantime? We have implemented incremental checkpoints to solve the first problem by checkpointing only the changes between checkpoints, instead of always recording the whole state. Asynchronous checkpoints address the second problem and enable Flink to continue processing concurrently to running checkpoints. In this talk, we will take a deep dive into the details of Flink's new checkpointing features. In particular, we will talk about the underlying datastructures, log-structured merge trees and copy-on-write hash tables, and how those building blocks are assembled and orchestrated to advance Flink's checkpointing.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
Kafka has become the de facto standard for streaming data with high-throughput, low-latency, and fault-tolerance. However, its rising adoption raises new challenges. In particular, the growing cluster sizes, increasing volume and diversity of user traffic, and aging network and server components induce an overhead in managing the system. This overhead makes it infeasible for human operators to constantly monitor, identify, and mitigate issues. The resulting utilization imbalance across brokers leads to unpredictable client performance due to the high variation in their throughput and latency. Finally, properly expanding, shrinking, or upgrading clusters also incurs a management overhead. Hence, adopting a principled approach to manage Kafka clusters is integral to the sustainability of the infrastructure.
This talk will describe how LinkedIn alleviates the management overhead of large-scale Kafka clusters using Cruise Control. To this end, first, we will discuss the reactive and proactive techniques that Cruise Control uses to support admin operations for cluster maintenance, enable anomaly detection with self-healing, and provide real-time monitoring for Kafka clusters. Next, we will examine how Cruise Control performs in production. Finally, we will conclude with questions and further discussion.
Kafka streams windowing behind the curtain confluent
Kafka Streams Windowing Behind the Curtain, Neil Buesing, Principal Solutions Architect, Rill
https://www.meetup.com/TwinCities-Apache-Kafka/events/279316299/
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
In the last 24 months, MySQL/MariaDB replication speed has improved a lot thanks to parallel replication. MySQL and MariaDB have different types of parallel replication; in this talk, I present the different implementations, with their limitations and the corresponding tuning parameters. I cover what to do to make parallel replication faster and what to avoid for maximizing parallel replication benefits. I also present benchmark results from real Booking.com workloads. Finally, I discuss some deployments at Booking.com that take advantage of parallel replication speed improvements.
About a wide range of spectrum of Data Consistency models.
Created thanks to a lot of smart and generous people that shared their knowledge and insights into the subject. Thanks to you all !
Hope this can help you scratch the surface of the subject of consistency models too.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
Choosing the right Professional Employer Organization will help your business remain in compliance, leverage the efficiencies of great technology, and facilitate access to comprehensive capabilities that will benefit your business and its employees. So what are the most basic and key things you need to know about PEO?
The Top Six Early Detection and Action Must-Haves for Improving OutcomesHealth Catalyst
Given the industry’s shift toward value-based, outcomes-based healthcare, organizations are working to improve outcomes. One of their top outcomes improvement priorities should be early detection and action, which can significantly improve clinical, financial, and patient experience outcomes. Through early detection and action, systems embrace a proactive approach to healthcare that aims to prevent illness; the earlier a condition is detected, the better the outcome.
But, as with most things in healthcare, improving early detection is easier said than done. This executive report provides helpful, actionable guidance about overcoming common barriers (logistical, cultural, and technical) and improving early detection and action by integrating six must-haves:
Multidisciplinary teams
Analytics
Leadership-driven culture change
Creative customization
Proof-of-concept pilot projects
In this presentation, Akka Team Lead and author Roland Kuhn presents the freshly released final specification for Reactive Streams on the JVM. This work was done in collaboration with engineers representing Netflix, Red Hat, Pivotal, Oracle, Typesafe and others to define a standard for passing streams of data between threads in an asynchronous and non-blocking fashion. This is a common need in Reactive systems, where handling streams of "live" data whose volume is not predetermined.
The most prominent issue facing the industry today is that resource consumption needs to be controlled such that a fast data source does not overwhelm the stream destination. Asynchrony is needed in order to enable the parallel use of computing resources, on collaborating network hosts or multiple CPU cores within a single machine.
Here we'll review the mechanisms employed by Reactive Streams, discuss the applicability of this technology to a variety of problems encountered in day to day work on the JVM, and give an overview of the tooling ecosystem that is emerging around this young standard.
In celebration of International Women's Day, we dug into some of our most interesting interviews with women in marketing and have put together the following slideshow highlighting some words of wisdom. Happy Women's Day!
Leading Adaptive Change to Create Value in HealthcareHealth Catalyst
In pursuit of the Triple Aim, healthcare leaders work hard to improve care, reduce costs, and improve the patient experience. But accomplishing these goals requires an engaged staff that makes progress, day in and day out. Adaptive Leadership (AL) principles help leaders understand human behavior to mobilize change and overcome work avoidance, which happens when staff operate above or below the productive zone of tension.
By understanding what adaptive work actually is (and that adaptive problems can’t be solved with technical fixes), and why work avoidance happens (because people are overwhelmed; the heat is too high), leaders can keep their teams engaged by using influence and leadership—not authority—to “lower the heat” on their people:
Validate the difficulty of the situation.
Simplify/clarify the work.
Provide additional resources (time, training, etc.)
Dr. Ulstad has worked with healthcare leaders and teams for the last 20 years to help them understand behaviors triggered by rapid, high-volume change, and apply AL principles to guide the changes critical to their organizations’ success.
How To Avoid The 3 Most Common Healthcare Analytics Pitfalls And Related Inef...Health Catalyst
Analytics are supposed to provide data-driven solutions, not additional healthcare analytics pitfalls and other related inefficiencies. Yet such issues are quite common. Becoming familiar with potential problems will help health systems avoid them in the future. The three common analytics pitfalls are point solutions, EHRs, and independent data marts located in many different databases. An EDW will counter all three of these problems. The two inefficiencies include report factories and flavor of the month projects. The solution that best overcomes these inefficiencies is a robust deployment system.
From Installed to Stalled: Why Sustaining Outcomes Improvement Requires More ...Health Catalyst
The big first step toward building an outcomes improvement program is installing the analytics platform. But it’s certainly not the only step. Sustaining healthcare outcomes improvement is a triathlon, and the three legs are:
Installing an analytics platform
Gaining adoption
Implementing best practices
The program requires buy-in, enthusiasm, even evangelizing of analytics and its tools throughout the organization. It also requires that learnings from analysis translate into best practices, otherwise the program fails to produce results and will eventually fade away. Equally important is that top-level leadership across the organization, not just IT, supports and promotes the program ongoing. We explore each of the elements and how they come together to create successful and sustainable outcomes improvement that defines leading healthcare organizations.
6 Proven Strategies for Engaging Physicians—and 4 Ways to FailHealth Catalyst
For healthcare organizations to be successful with their quality and cost improvement initiatives, physicians must be engaged with the proposed changes. But many physicians are not engaged because their morale is suffering. While some strategies to encourage buy-in for improvement initiatives don’t work, there are six strategies that have proven to be effective: (1) discover a common purpose, (2) adopt an engaging style, (3) turn physicians into partners, not customers, (4) segment the engagement plan, (5) use “engaging” improvement methods, and (6) provide them with backup—all the way to the board. Once the organization has their trust, physicians will gain enthusiasm to move forward with improvement efforts that will benefit everyone.
The 3 Must-Have Qualities of a Care Management SystemHealth Catalyst
Care management systems are defined in many ways, but the only effective system comprises three qualities:
1.) It’s comprehensive and includes a suite of tools to address all five core competencies of care management.
2.) It’s inclusive of all EMRs and other data sources to enable thorough communication and analysis.
3.) It’s analytics-driven design facilitates clinical decision making and workflow.
Ultimately, an effective system improves outcomes and becomes an indispensable tool for managing population health.
This article describes what drives successful care management, and reveals a suite of applications that aid care team members and patients through advanced algorithms and embedded analytics. Learn how technology is helping to develop appropriate interventions and improve clinical and financial outcomes.
How to Sustain Healthcare Quality Improvement in 3 Critical StepsHealth Catalyst
Many healthcare organizations don’t hold quality and cost gains because they don’t make improvement the backbone of their organization. Rather, they approach improvement as a series of initiatives. Ronald D. Snee, a fellow with the American Society for Quality states, “Many organizations focus on sustaining the gains only after improvement has been achieved. Intuitively, that may seem the correct sequence, but it is in fact backwards. The time to focus on sustaining improvement gains is well before the initiative is launched.”
Here are 3 critical organizational steps that can help sustain those gains.
Patient Flight Path Analytics: From Airline Operations to Healthcare OutcomesHealth Catalyst
We developed a predictive analytics framework for patient care based upon concepts from airline operations. Using the idea of an aircraft turnaround time where the airline wants to put the aircraft back into operation as soon as possible, we’ve created a way to help patients headed toward poor outcomes, along with their providers, “turnaround” and get the best possible, most cost-effective outcome. For example, in a diabetes patient, we might use variables such as: age, alcohol use, annual eye/foot exam, BMI, etc. to look for patterns that might influence two outcomes: 1) Diabetic control and 2) The absence of progression toward diabetic complications. The notion of our Patient Flight Path is useful at both the conceptual level, as well as the predictive algorithm implementation level.
Database vs Data Warehouse: A Comparative ReviewHealth Catalyst
What are the differences between a database and a data warehouse? A database is any collection of data organized for storage, accessibility, and retrieval. A data warehouse is a type of database the integrates copies of transaction data from disparate source systems and provisions them for analytical use. The important distinction is that data warehouses are designed to handle analytics required for improving quality and costs in the new healthcare environment. A transactional database, like an EHR, doesn’t lend itself to analytics.
Quality Improvement In Healthcare: Where Is The Best Place To Start?Health Catalyst
One of the biggest challenges providers face in their quality improvement efforts is knowing where to get started. In my experience, one of the best ways to overcome that “where do we begin?” factor is by using data from an enterprise data warehouse to look for high-cost areas where there are large variations in how health care is delivered. Variation found through the KPA is an indicator of opportunity. The more avoidable variation that is reflected in a particular care process, the more opportunity there is to reduce that variation and standardize the process. Suppose after performing a KPA you discover three areas of opportunity. How do you determine which one to pursue, especially if it’s your first journey into process improvement? The most obvious answer would seem to be the one with the largest potential ROI. That may not always be the best course to pursue, however. You will also want to take into consideration the readiness/openness to change in each of those areas.
MongoDB's architecture features built-in support for horizontal scalability, and high availability through replica sets. Auto-sharding allows users to easily distribute data across many nodes. Replica sets enable automatic failover and recovery of database nodes within or across data centers. This session will provide an introduction to scaling with MongoDB by one of MongoDB's early adopters.
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
MongoDB is easy to download and run locally but requires some thought and further understanding when deploying to production. At scale, schema design, indexes and query patterns really matter. So does data structure on disk, sharding, replication and data centre awareness. This talk will examine these factors in the context of analytics, and more generally, to help you optimise MongoDB for any scale.
Presented at MongoDB Days London 2013 by David Mytton.
What if you could get blazing fast queries on your data without having to be on call for a giant, expensive database? By picking the right file format for your data, you can store your data on disk in the cloud and still get the performance you need for modern analytics. We'll discuss benchmarks of four different data storage formats: Parquet, ORC, Avro, and traditional character-separated files like CSV. We'll cover what they are, how they work at a bits-and-bytes level, and why you might choose each one for your use case.
Elasticsearch Arcihtecture & What's New in Version 5Burak TUNGUT
General architectural concepts of Elasticsearch and what's new in version 5? Examples are prepared with our company business therefore these are excluded from presentation.
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
This will cover what to consider for high write throughput performance from hardware configuration through to the use of replica sets, multi-data centre deployments, monitoring and sharding to ensure your database is fast and stays online.
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
Comparing the burst buffers of today, such as the Cray DataWarp-based burst buffer implemented on NERSC Cori, to the proto-burst buffer deployed on SDSC's Gordon supercomputer in 2012.
Leveraging Databricks for Spark PipelinesRose Toomey
How Coatue Management saved time and money by moving Spark pipelines to Databricks.
Talk given at AWS + Databricks ML Dev Day workshop in NYC on 27 February 2020.
Leveraging Databricks for Spark pipelinesRose Toomey
How Coatue Management saved time and money by moving Spark pipelines to Databricks.
Talk given at AWS + Databricks ML Dev Day workshop in NYC on 27 February 2020.
Similar to Optimizing MongoDB: Lessons Learned at Localytics (20)
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
2. Me
• Email: my first name @ localytics.com
• twitter.com/andrew311
• andrewrollins.com
• Founder, Chief Software Architect at Localytics
3. Localytics
• Real time analytics for mobile applications
• Built on:
– Scala
– MongoDB
– Amazon Web Services
– Ruby on Rails
– and more…
4. Why I‟m here: brain dump!
• To share tips, tricks, and gotchas about:
– Documents
– Indexes
– Fragmentation
– Migrations
– Hardware
– MongoDB on AWS
• Basic to more advanced, a compliment to
MongoDB Perf Tuning at MongoSF 2011
5. MongoDB at Localytics
• Use cases:
– Anonymous loyalty information
– De-duplication of incoming data
• Requirements:
– High throughput
– Add capacity without long down-time
• Scale today:
– Over 1 billion events tracked in May
– Thousands of MongoDB operations a second
6. Why MongoDB?
• Stability
• Community
• Support
• Drivers
• Ease of use
• Feature rich
• Scale out
9. Use BinData for UUIDs/hashes
Bad:
{
u: “21EC2020-3AEA-1069-A2DD-08002B30309D”,
// 36 bytes plus field overhead
}
Good:
{
u: BinData(0, “…”),
// 16 bytes plus field overhead
}
10. Override _id
Turn this
{
_id : ObjectId("47cc67093475061e3d95369d"),
u: BinData(0, “…”) // <- this is uniquely indexed
}
into
{
_id : BinData(0, “…”) // was the u field
}
Eliminated an extra index, but be careful about
locality... (more later, see Further Reading at end)
11. Pack „em in
• Look for cases where you can squish multiple
“records” into a single document.
• Why?
– Decreases number of index entries
– Brings documents closer to the size of a page,
alleviating potential fragmentation
• Example: comments for a blog post.
12. Prefix Indexes
Suppose you have an index on a large field, but that field doesn‟t have
many possible values. You can use a “prefix index” to greatly decrease
index size.
find({k: <kval>})
{
k: BinData(0, “…”), // 32 byte SHA256, indexed
}
into find({p: <prefix>, k: <kval>})
{
k: BinData(0, “…”), // 28 byte SHA256 suffix, not indexed
p: <32-bit integer> // first 4 bytes of k packed in integer, indexed
}
Example: git commits
14. Fragmentation
• Data on disk is memory mapped into RAM.
• Mapped in pages (4KB usually).
• Deletes/updates will cause memory
fragmentation.
Disk RAM
doc1 doc1
find(doc1) Page
deleted deleted
… …
15. New writes mingle with old data
Data
doc1
Page
Write docX docX
doc3
doc4 Page
doc5
find(docX) also pulls in old doc1, wasting RAM
16. Dealing with fragmentation
• “mongod --repair” on a secondary, swap with
primary.
• 1.9 has in-place compaction, but this still holds a
write-lock.
• MongoDB will auto-pad records.
• Pad records yourself by including and then
removing extra bytes on first insert.
– Alternative offered in SERVER-1810.
17. The Dark Side of Migrations
• Chunks are a logical construct, not physical.
• Shard keys have serious implications.
• What could go wrong?
– Let‟s run through an example.
18. Suppose the following
Chunk 1 • K is the shard key
k: 1 to 5
• K is random
Chunk 2
k: 6 to 9
Shard 1
{k: 3, …} 1st write
{k: 9, …} 2nd write
{k: 1, …} and so on
{k: 7, …}
{k: 2, …}
{k: 8, …}
21. Why is this scenario bad?
• Random reads
• Massive fragmentation
• New writes mingle with old data
22. How can we avoid bad migrations?
• Pre-split, pre-chunk
• Better shard keys for better locality
– Ideally where data in the same chunk tends to be in
the same region of disk
23. Pre-split and move
• If you know your key distribution, then pre-create
your chunks and assign them.
• See this:
– http://blog.zawodny.com/2011/03/06/mongodb-pre-
splitting-for-faster-data-loading-and-importing/
24. Better shard keys
• Usually means including a time prefix in your
shard key (e.g., {day: 100, id: X})
• Beware of write hotspots
• How to Choose a Shard Key
– http://www.snailinaturtleneck.com/blog/2011/01/04/ho
w-to-choose-a-shard-key-the-card-game/
26. Working Set in RAM
• EC2 m2.2xlarge, RAID0 setup with 16 EBS volumes.
• Workers hammering MongoDB with this loop, growing data:
– Loop { insert 500 byte record; find random record }
• Thousands of ops per second when in RAM
• Much less throughput when working set (in this case, all data
and index) grows beyond RAM.
Ops per second over time
In RAM
Not In RAM
27. Pre-fetch
• Updates hold a lock while they fetch the original
from disk.
• Instead do a read to warm the doc in RAM under
a shared read lock, then update.
28. Shard per core
• Instead of a shard per server, try a shard per
core.
• Use this strategy to overcome write locks when
writes per second matter.
• Why? Because MongoDB has one big write lock.
29. Amazon EC2
• High throughput / small working set
– RAM matters, go with high memory instances.
• Low throughput / large working set
– Ephemeral storage might be OK.
– Remember that EBS IO goes over Ethernet.
– Pay attention to IO wait time (iostat).
– Your only shot at consistent perf: use the biggest
instances in a family.
• Read this:
– http://perfcap.blogspot.com/2011/03/understanding-
and-using-amazon-ebs.html
30. Amazon EBS
• ~200 seeks per second per EBS on a good day
• EBS has *much* better random IO perf than
ephemeral, but adds a dependency
• Use RAID0
• Check out this benchmark:
– http://orion.heroku.com/past/2009/7/29/io_performanc
e_on_ebs/
• To understand how to monitor EBS:
– https://forums.aws.amazon.com/thread.jspa?messag
eID=124044
31. Further Reading
• MongoDB Performance Tuning
– http://www.scribd.com/doc/56271132/MongoDB-Performance-Tuning
• Monitoring Tips
– http://blog.boxedice.com/mongodb-monitoring/
• Markus‟ manual
– http://www.markus-gattol.name/ws/mongodb.html
• Helpful/interesting blog posts
– http://nosql.mypopescu.com/tagged/mongodb/
• MongoDB on EC2
– http://www.slideshare.net/jrosoff/mongodb-on-ec2-and-ebs
• EC2 and Ephemeral Storage
– http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-
ec2.html
• MongoDB Strategies for the Disk Averse
– http://engineering.foursquare.com/2011/02/09/mongodb-strategies-for-the-disk-averse/
• MongoDB Perf Tuning at MongoSF 2011
– http://www.scribd.com/doc/56271132/MongoDB-Performance-Tuning
32. Thank you.
• Check out Localytics for mobile analytics!
• Reach me at:
– Email: my first name @ localytics.com
– twitter.com/andrew311
– andrewrollins.com