Mark Dunphy discusses the evolution of the Activity Feed system at Behance from MongoDB to Cassandra. The initial MongoDB implementation worked well for the smaller user base in 2011 but struggled with node failures, disk fragmentation, and downtime as the system scaled. Cassandra was chosen to replace it due to its scalability, performance, and ease of maintenance. Dunphy details the process of learning Cassandra, modeling the data, and developing a strategy for writes and reads that allowed the system to handle thousands of writes and reads per second while serving the Activity Feed to users.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentHostedbyConfluent
Joins in Kafka Streams and ksqlDB are a killer-feature for data processing and basic join semantics are well understood. However, in a streaming world records are associated with timestamps that impact the semantics of joins: welcome to the fabulous world of _temporal_ join semantics. For joins, timestamps are as important as the actual data and it is important to understand how they impact the join result.
In this talk we want to deep dive on the different types of joins, with a focus of their temporal aspect. Furthermore, we relate the individual join operators to the overall ""time engine"" of the Kafka Streams query runtime and explain its relationship to operator semantics. To allow developers to apply their knowledge on temporal join semantics, we provide best practices, tip and tricks to ""bend"" time, and configuration advice to get the desired join results. Last, we give an overview of recent, and an outlook to future, development that improves joins even further.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentHostedbyConfluent
Joins in Kafka Streams and ksqlDB are a killer-feature for data processing and basic join semantics are well understood. However, in a streaming world records are associated with timestamps that impact the semantics of joins: welcome to the fabulous world of _temporal_ join semantics. For joins, timestamps are as important as the actual data and it is important to understand how they impact the join result.
In this talk we want to deep dive on the different types of joins, with a focus of their temporal aspect. Furthermore, we relate the individual join operators to the overall ""time engine"" of the Kafka Streams query runtime and explain its relationship to operator semantics. To allow developers to apply their knowledge on temporal join semantics, we provide best practices, tip and tricks to ""bend"" time, and configuration advice to get the desired join results. Last, we give an overview of recent, and an outlook to future, development that improves joins even further.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
Join the Altinity experts as we dig into ClickHouse sharding and replication, showing how they enable clusters that deliver fast queries over petabytes of data. We’ll start with basic definitions of each, then move to practical issues. This includes the setup of shards and replicas, defining schema, choosing sharding keys, loading data, and writing distributed queries. We’ll finish up with tips on performance optimization.
#ClickHouse #datasets #ClickHouseTutorial #opensource #ClickHouseCommunity #Altinity
-----------------
Join ClickHouse Meetups: https://www.meetup.com/San-Francisco-...
Check out more ClickHouse resources: https://altinity.com/resources/
Visit the Altinity Documentation site: https://docs.altinity.com/
Contribute to ClickHouse Knowledge Base: https://kb.altinity.com/
Join the ClickHouse Reddit community: https://www.reddit.com/r/Clickhouse/
----------------
Learn more about Altinity!
Site: https://www.altinity.com
LinkedIn: https://www.linkedin.com/company/alti...
Twitter: https://twitter.com/AltinityDB
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
Webinar. Presented by Robert Hodges and Ned McClain, April 1, 2020
You are about to deploy ClickHouse into production. Congratulations! But what about monitoring? In this webinar we will introduce how to track the health of individual ClickHouse nodes as well as clusters. We'll describe available monitoring data, how to collect and store measurements, and graphical display using Grafana. We'll demo techniques and share sample Grafana dashboards that you can use for your own clusters.
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
Columnar stores like ClickHouse enable users to pull insights from big data in seconds, but only if you set things up correctly. This talk will walk through how to implement a data warehouse that contains 1.3 billion rows using the famous NY Yellow Cab ride data. We'll start with basic data implementation including clustering and table definitions, then show how to load efficiently. Next, we'll discuss important features like dictionaries and materialized views, and how they improve query efficiency. We'll end by demonstrating typical queries to illustrate the kind of inferences you can draw rapidly from a well-designed data warehouse. It should be enough to get you started--the next billion rows is up to you!
Apache Doris (incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis. It is open-sourced by Baidu. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
Why do queries run out of memory? How can I make my queries even faster? How should I size ClickHouse nodes for best cost-efficiency? The key to these questions and many others is knowing what happens inside ClickHouse when a query runs. This webinar is a gentle introduction to ClickHouse internals, focusing on topics that will help your applications run faster and more efficiently. We’ll discuss the basic flow of query execution, dig into how ClickHouse handles aggregation and joins, and show you how ClickHouse distributes processing within a single CPU as well as across many nodes in the network. After attending this webinar you’ll understand how to open up the black box and see what the parts are doing.
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
Kubernetes has the concept of resource requests and limits. Pods get scheduled on the nodes based on their requests and optionally limited in how much of the resource they can consume. Understanding and optimizing resource requests/limits is crucial both for reducing resource "slack" and ensuring application performance/low-latency. This talk shows our approach to monitoring and optimizing Kubernetes resources for 80+ clusters to achieve cost-efficiency and reducing impact for latency-critical applications. All shown tools are Open Source and can be applied to most Kubernetes deployments.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
A Deep Dive into Apache Cassandra for .NET DevelopersLuke Tillman
.NET developers have a lot of options when it comes to databases these days. Apache Cassandra is a scalable, fault-tolerant database that has already found its way into more than 25% of the Fortune 100 and continues to grow in popularity. But what makes it different from the myriad of other options available? In this talk, we’ll take a deep dive into Cassandra and learn about:
- Cassandra’s internals and how it works
- CQL (the SQL-like query language for Cassandra)
- Data Modeling like a pro
- Tools available for developers
- Writing .NET code that talks to Cassandra
If there’s time and interest, we’ll finish up with how some companies are already using Cassandra to power services you probably interact with in your daily life. You’ll leave with all the tools you need to start build highly available .NET applications and services on top of Cassandra.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Mobile 2: What's My Place in the Universe? Using Geo-Indexing to Solve Existe...MongoDB
Second in the mobile series. Our mobile application takes shape as we build out a schema to support geo-specific data and friend finding. Learn how to use MongoDB geo-indexing for real-time proximity matching and gather analytics with the aggregation framework.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
Join the Altinity experts as we dig into ClickHouse sharding and replication, showing how they enable clusters that deliver fast queries over petabytes of data. We’ll start with basic definitions of each, then move to practical issues. This includes the setup of shards and replicas, defining schema, choosing sharding keys, loading data, and writing distributed queries. We’ll finish up with tips on performance optimization.
#ClickHouse #datasets #ClickHouseTutorial #opensource #ClickHouseCommunity #Altinity
-----------------
Join ClickHouse Meetups: https://www.meetup.com/San-Francisco-...
Check out more ClickHouse resources: https://altinity.com/resources/
Visit the Altinity Documentation site: https://docs.altinity.com/
Contribute to ClickHouse Knowledge Base: https://kb.altinity.com/
Join the ClickHouse Reddit community: https://www.reddit.com/r/Clickhouse/
----------------
Learn more about Altinity!
Site: https://www.altinity.com
LinkedIn: https://www.linkedin.com/company/alti...
Twitter: https://twitter.com/AltinityDB
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
Webinar. Presented by Robert Hodges and Ned McClain, April 1, 2020
You are about to deploy ClickHouse into production. Congratulations! But what about monitoring? In this webinar we will introduce how to track the health of individual ClickHouse nodes as well as clusters. We'll describe available monitoring data, how to collect and store measurements, and graphical display using Grafana. We'll demo techniques and share sample Grafana dashboards that you can use for your own clusters.
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
Columnar stores like ClickHouse enable users to pull insights from big data in seconds, but only if you set things up correctly. This talk will walk through how to implement a data warehouse that contains 1.3 billion rows using the famous NY Yellow Cab ride data. We'll start with basic data implementation including clustering and table definitions, then show how to load efficiently. Next, we'll discuss important features like dictionaries and materialized views, and how they improve query efficiency. We'll end by demonstrating typical queries to illustrate the kind of inferences you can draw rapidly from a well-designed data warehouse. It should be enough to get you started--the next billion rows is up to you!
Apache Doris (incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis. It is open-sourced by Baidu. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
Why do queries run out of memory? How can I make my queries even faster? How should I size ClickHouse nodes for best cost-efficiency? The key to these questions and many others is knowing what happens inside ClickHouse when a query runs. This webinar is a gentle introduction to ClickHouse internals, focusing on topics that will help your applications run faster and more efficiently. We’ll discuss the basic flow of query execution, dig into how ClickHouse handles aggregation and joins, and show you how ClickHouse distributes processing within a single CPU as well as across many nodes in the network. After attending this webinar you’ll understand how to open up the black box and see what the parts are doing.
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
Kubernetes has the concept of resource requests and limits. Pods get scheduled on the nodes based on their requests and optionally limited in how much of the resource they can consume. Understanding and optimizing resource requests/limits is crucial both for reducing resource "slack" and ensuring application performance/low-latency. This talk shows our approach to monitoring and optimizing Kubernetes resources for 80+ clusters to achieve cost-efficiency and reducing impact for latency-critical applications. All shown tools are Open Source and can be applied to most Kubernetes deployments.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
A Deep Dive into Apache Cassandra for .NET DevelopersLuke Tillman
.NET developers have a lot of options when it comes to databases these days. Apache Cassandra is a scalable, fault-tolerant database that has already found its way into more than 25% of the Fortune 100 and continues to grow in popularity. But what makes it different from the myriad of other options available? In this talk, we’ll take a deep dive into Cassandra and learn about:
- Cassandra’s internals and how it works
- CQL (the SQL-like query language for Cassandra)
- Data Modeling like a pro
- Tools available for developers
- Writing .NET code that talks to Cassandra
If there’s time and interest, we’ll finish up with how some companies are already using Cassandra to power services you probably interact with in your daily life. You’ll leave with all the tools you need to start build highly available .NET applications and services on top of Cassandra.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Mobile 2: What's My Place in the Universe? Using Geo-Indexing to Solve Existe...MongoDB
Second in the mobile series. Our mobile application takes shape as we build out a schema to support geo-specific data and friend finding. Learn how to use MongoDB geo-indexing for real-time proximity matching and gather analytics with the aggregation framework.
Building a complete social networking platform presents many challenges at scale. Socialite is a reference architecture and open source Java implementation of a scalable social feed service built on DropWizard and MongoDB. We'll provide an architectural overview of the platform, explaining how you can store an infinite timeline of data while optimizing indexing and sharding configuration for access to the most recent window of data. We'll also dive into the details of storing a social user graph in MongoDB.
Socialite, the Open Source Status Feed Part 3: Scaling the Data FeedMongoDB
Scaling the delivery of posts and content to the follower networks of millions of users has many challenges. In this section we look at the various approaches to fanning out posts and look at a performance comparison between them. We will highlight some tricks for caching the recent timeline of active users to drive down read latency. We will also look at overall performance metrics from Socialite as we scale from a single replica set to a large sharded environment using MMS Automation.
Developers love MongoDB because its flexible document model enhances their productivity. But did you know that MongoDB supports rich queries and lets you accomplish some of the same things you currently do with SQL statements? And that MongoDB's powerful aggregation framework makes it possible to perform real-time analytics for dashboards and reports?
Attend this webinar for an introduction to the MongoDB aggregation framework and a walk through of what you can do with it. We'll also demo using it to analyze U.S. census data.
MongoGraph - MongoDB Meets the Semantic WebDATAVERSITY
AllegroGraph is a fully ACID and highly scalable RDF triplestore that can be programmed with compiled, server side JavaScript. This allows programmers to easily manipulate individual triples and create their own intelligent graph or reasoning algorithms. However, one wish that has been expressed by many programmers is to work on the level of objects instead of individual triples, where an object would be defined as all the triples with the same subject. So we created a MongoDB interface where programmers can add, delete and modify JSON objects directly into MongoDB like substores. This gives us the best of both worlds. We get the beautiful simplicity of the MongoDB interface for working with objects and we get the all the properties of an advanced triplestore, in this case joins through SPARQL queries, automatic indexing of all attributes/values, ACID properties all packaged to deliver a simple entry into the world of the Semantic Web.
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB
The popularity of dedicated graph technologies has risen greatly in recent years, at least partly fuelled by the explosion in social media and similar systems, where a friend network or recommendation engine is often a critical component when delivering a successful application. MongoDB 3.4 introduces a new Aggregation Framework graph operator, $graphLookup, to enable some of these types of use cases to be built easily on top of MongoDB. We will see how semantic relationships can be modelled inside MongoDB today, how the new $graphLookup operator can help simplify this in 3.4, and how $graphLookup can be used to leverage these relationships and build a commercially focused news article recommendation system.
MongoDB Days Silicon Valley: Implementing Graph Databases with MongoDBMongoDB
Presented by: John Barry, Senior Platform Developer, Apervita
Experience level: Deep dive
Graph databases are natural representation for many forms of data - from social network relationships, to music genres, to medical vocabulary data. Apervita’s challenge is importing and housing vast medical ontologies, and allowing efficient traversal of the graph retrieving both parent and child nodes across a variety of relationships.
In this session, we will discuss the characteristics of graph databases, an implementation strategy for MongoDB, and an approach for implementing transitive closure for efficient traversal of the graph with limited database trips.
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial IndexesMongoDB
This is the fourth webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will introduce you to the aggregation framework.
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
This is the first webinar of a Back to Basics series that will introduce you to the MongoDB database, what it is, why you would use it, and what you would use it for.
Webinar: Back to Basics: Thinking in DocumentsMongoDB
New applications, users and inputs demand new types of data, like unstructured, semi-structured and polymorphic data. Adopting MongoDB means adopting to a new, document-based data model.
While most developers have internalized the rules of thumb for designing schemas for relational databases, these rules don't apply to MongoDB. Documents can represent rich data structures, providing lots of viable alternatives to the standard, normalized, relational model. In addition, MongoDB has several unique features, such as atomic updates and indexed array keys, that greatly influence the kinds of schemas that make sense.
In this session, Buzz Moschetti explores how you can take advantage of MongoDB's document model to build modern applications.
View all the MongoDB World 2016 Poster Sessions slides in one place!
Table of Contents:
1: BigData DB Infrastructure for Modeling the Fly Brain
2: Taming the WiredTiger Cache
3: Sharding with MongoDB 3.2 Kick the tires and pop the hood!
4: Scaling Proactive Anomaly Detection
5: MongoTx: Transactions with Sharding and Queries
6: MongoDB: It’s Not Too Late To Shard
7: DLIFLC usage of MongoDB
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkMongoDB
This is the fifth webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will introduce you to the aggregation framework.
Slides from my last presentation at the Cape Town Meteor meetup, on optimising the UI, specifically for Hybrid apps and for Meteor JS hybrid apps.
The main thrust is really more about design patterns, and carefully controlling data management in your mobile app, with great examples of these patterns out in the real world.
see the mobile patterns video here : https://www.youtube.com/watch?v=e6WWX4TF3UI
In April 2014, Pinterest engineers presented to members of the engineering community at a series of Tech Talks held at the Pinterest offices in San Francisco. Topics included:
- Mobile & Growth: Scaling user education on mobile, and a deep dive into the new user experience (with engineers Dannie Chu and Wendy Lu)
- Monetization & Data: The open sourcing of Pinterest Secor and a look at zero data loss log persistence services (with engineer Pawel Garbacki)
- Developing & Shipping Code at Pinterest: The tools and technologies Pinterest uses to build quickly and deploy confidently.
You can find more at: engineering.pinterest.com and facebook.com/pinterestengineering
Bridging Current Reality & Future Vision with Reality MapsMalini Rao
Using a versatile design research technique, this presentation calls designers to give themselves permission to be flexible in their design practice by being the master of their techniques and get creative with the design process as much as they get creative with the experiences they design!
Data Driven Design - Frontend Conference ZurichMemi Beltrame
Data driven prototyping goes far beyond the mere administration of content for prototyping purposes. It is a powerful tool to handle the needs arising from interfaces with extensive amounts of microcontent - tiny but important pieces of content, usually involved in microinteractions like transactions, changes or updates.
Ever so often users are sensitive to minute changes of content - for example stock prices changing quickly or dates and times in news reflecting current time. In these cases it is important to be able to rely on dynamic data that simulates the behavior of the real content as close as possible.
This talk is about why it is important to build rich functional prototypes that focus on content and how this can be achieved. It gives an overview of the benefits and obstacles of data driven prototyping and contains a wide range of examples of how data driven prototyping can make the difference between a good and a great prototype.
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
This is part 1 of a series of webinars on Microservices architecture for the modern enterprise. We will start broad and gather a high-level understanding and feel for the architecture using use cases and stories from some of our customers. These techniques can of course be applied whether or not you use the Iron platform to deploy and orchestrate your microservices.
Building Reactive Real-time Data PipelineTrieu Nguyen
Topic: Building reactive real-time data pipeline at FPT ?
1) What is “Data Pipeline” ?
2) Big Data Problems at FPT
+ VnExpress: pageview and heat-map
+ eClick: real-time reactive advertising
3) Solutions and Patterns
4) Fast Data Architecture at FPT
5) Wrap up
A run-down of the Drupal 8 initiatives for Drupal 8.2 and beyond: Migrate, Content Workflow, API-first, Media, Blocks and Layouts, Data Modelling, Theme Component Library, Cross-Channel Orchestration
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
13. • Frequent node failures
• Heavy disk fragmentation caused by deletes
• Slow reads from disk. Started storing in RAM.
• Primary -> Secondary caused downtime for
some.
• Scaled out vertically and horizontally.
17. • Fantastic community. #cassandra on Twitter
• Easy to read documentation
• Linearly scalable. Easy to grow cluster.
• Low maintenance overhead for ops team.
• Handles time series data very well.
23. • User’s feed is comprised of entities with one set
of actions
• User’s feed only contains one of any given entity
• An entity’s set of actions contains up to seven of
the most recent actions taken by that user’s
network
28. • App/cluster in production before anything works
• Test real life load
• Fail spectacularly without anybody noticing
• Deploy risky changes without fear
• Run alongside MongoDB
30. Query Patterns
• “Create your data models based on the queries
you want to run” - Basically Everybody
• Wanted to…
• Read a user’s feed entities by type and time of
most recent action…separately.
• Write/Update a user’s feed entities with new
actions while knowing only user id and entity id
32. –Mark Dunphy, January 2015
“An UPDATE in Cassandra works like an
UPSERT! Let’s store the user’s entire feed in a
single row in a table! It’s so simple!”
First Data Model
33. CREATE TYPE activity.action (
created_on timestamp,
secondary_entity_id int,
actor_id int,
verb_id int
);
CREATE TYPE activity.entity (
entity_type_id int,
entity_id int
);
38. –Mark Dunphy, January 2015
“Okay let’s keep nearly the same model, but
use INSERT and DELETE instead of always
UPDATE. Just use batch statements.”
Second Data Model
40. • Lose the benefit of Cassandra being distributed
• All queries go through the same coordinator
which puts a lot of stress and responsibility on
one node.
• Use concurrency and prepared statements
instead. Datastax drivers make this easy.
Second Data Model
49. Write Strategy
• “User A comments on Project A. User B follows
User A.”
• Request out to add the comment action to User
B’s feed
• Read existing actions for that entity (Project A) in
B’s feed. Push new action on top.
• Write new actions list into new “row” in projects
table
50. Read Strategy
• SELECT * FROM projects WHERE user_id
= 123 AND created_on > 123214373
• Optimized for quick/easy reads. More important
that a user’s feed loads quickly than it updating
quickly.
• Use timestamp to “page” through data.
51. Lessons Learned
• Duplicate your data to achieve desired queries.
Storage is cheap. Writes are cheap.
• Think outside the box. Cassandra is not
relational.
• Never ever ever ignore inserts/deletes in favor of
an update only workflow. Never. It is literally
insane.
52. Final Specs
• 16 node cluster on AWS EC2 c3.8xlarge
• Mix of SizeTieredCompactionStrategy and
DateTieredCompactionStrategy
• NetworkTopologyStrategy
• Replication factor 3
• ConsistencyLevel = ONE for most requests
53. Final Specs
• Bursty write volume. Consistent read volume.
• 5k to 80k writes per second
• 2k to 4k reads per second