ZooKeeper allows for dynamic reconfiguration of servers in its ensemble. Manual reconfiguration is problematic as it requires changing the configuration, restarting servers, and can result in data loss. The presented solution allows ZooKeeper to reconfigure itself automatically through a speculative reconfiguration approach. It commits the reconfiguration once quorums of both the old and new ensembles acknowledge it, and gossips the new configuration to ensure all servers sync before activation. This allows reconfigurations to complete without failures in a transparent manner to clients.
Deep Dive on Amazon Aurora PostgreSQL Performance Tuning (DAT428-R1) - AWS re...Amazon Web Services
Amazon Aurora offers several options for monitoring and optimizing PostgreSQL database performance. These include Enhanced Monitoring and Performance Insights, an easy-to-use tool for assessing the load on your database and identifying slow-performing queries. In this session, learn how to tune the performance of your Aurora database with PostgreSQL compatibility, whether your application is in development or in production. Please join us for a speaker meet-and-greet following this session at the Speaker Lounge (ARIA East, Level 1, Willow Lounge). The meet-and-greet starts 15 minutes after the session and runs for half an hour.
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Deep Dive on Amazon Aurora PostgreSQL Performance Tuning (DAT428-R1) - AWS re...Amazon Web Services
Amazon Aurora offers several options for monitoring and optimizing PostgreSQL database performance. These include Enhanced Monitoring and Performance Insights, an easy-to-use tool for assessing the load on your database and identifying slow-performing queries. In this session, learn how to tune the performance of your Aurora database with PostgreSQL compatibility, whether your application is in development or in production. Please join us for a speaker meet-and-greet following this session at the Speaker Lounge (ARIA East, Level 1, Willow Lounge). The meet-and-greet starts 15 minutes after the session and runs for half an hour.
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
ClickHouse clusters depend on ZooKeeper to handle replication and distributed DDL commands. In this Altinity webinar, we’ll explain why ZooKeeper is necessary, how it works, and introduce the new built-in replacement named ClickHouse Keeper. You’ll learn practical tips to care for ZooKeeper in sickness and health. You’ll also learn how/when to use ClickHouse Keeper. We will share our recommendations for keeping that happy as well.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesMarkus Michalewicz
This presentation discusses operational best practices considering the increasing tendency to use automation to tackle repetitive tasks, which changes how best practices are applied. The presentation therefore introduces and explains which Oracle tools can and should be used to apply best practices. It also discusses "smart features" that one will benefit from automatically after upgrading to Oracle RAC 12c Rel. 2. This presentation was first presented during UKOUG Tech17.
In this session we want to explore the various ways you can setup a connection strategy. We'll start with Oracle's UCP (Universal Connection Pool), its architecture and most notably, how do you size it? We'll discuss important concepts such as: connection reservation, and the distinction between connection, process and session.
Besides UCP there are: Database Resident Connection Pool (DRCP) and Proxy Resident Connection Pool (PRCP). Which will both be discussed. We'll also look into combining different types of pools: what are their typical use-cases, and what are the pitfalls?
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
Apache Kafka를 이용하여 이미지 데이터를 얼마나 빠르게(with low latency) 전달 가능한지 성능 테스트.
최종 목적은 AI(ML/DL) 모델의 입력으로 대량의 실시간 영상/이미지 데이터를 전달하는 메세지 큐로 사용하기 위하여, Drone/제조공정 등의 장비에서 전송된 이미지를 얼마나 빨리 AI Model로 전달 할 수 있는지 확인하기 위함.
그래서 Kafka에서 이미지를 전송하는 간단한 테스트를 진행하였고,
이 과정에서 latency를 얼마나 줄여주는지를 확인해 보았다.(HTTP 프로토콜/Socket과 비교하여)
[현재 까지 결론]
- Apache Kafka는 대량의 요청 처리를 위한 throughtput에 최적화 된 솔루션임.
- 현재는 producer의 몇가지 옵션만 조정하여 테스트한 결과이므로,
- 잠정적인 결과이지만, kafka의 latency를 향상을 위해서는 많은 시도가 필요할 것 같음.
- 즉, 단일 요청의 latency는 확실히 느리지만,
- 대량의 처리를 기준으로 평균 latency를 비교하면 평균적인 latency는 많이 낮아짐.
Test Code : https://github.com/freepsw/kafka-latency-test
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
Building an Event Streaming Architecture with Apache PulsarScyllaDB
What is Apache Pulsar? How does it differ from other event streaming technologies available? StreamNative Developer Advocate Tim Spann will walk you through the features and architecture of this increasingly popular event streaming system, along with best practices for streaming and storing your data.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
Presented at the webinar, July 31, 2019
Built-in replication is a powerful ClickHouse feature that helps scale data warehouse performance as well as ensure high availability. This webinar will introduce how replication works internally, explain configuration of clusters with replicas, and show you how to set up and manage ZooKeeper, which is necessary for replication to function. We'll finish off by showing useful replication tricks, such as utilizing replication to migrate data between hosts. Join us to become an expert in this important subject!
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
Compare top PostgreSQL high availability frameworks - PostgreSQL Automatic Failover (PAF), Replication Manager (repmgr) and Patroni to improve your app uptime. ScaleGrid blog - https://scalegrid.io/blog/whats-the-best-postgresql-high-availability-framework-paf-vs-repmgr-vs-patroni-infographic/
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd
Graham Mainwaring and Robert Hodges summarize management of ClickHouse on Kubernetes using the ClickHouse Kubernetes Operator and introduce a new UI for it. Presented at the 15 Dec '22 SF Bay Area ClickHouse Meetup.
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
This talk was given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn) at the ACM SIGMOD/PODS Conference (June 2013). For the paper written by the LinkedIn Espresso Team, go here:
http://www.slideshare.net/amywtang/espresso-20952131
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesMarkus Michalewicz
This presentation discusses operational best practices considering the increasing tendency to use automation to tackle repetitive tasks, which changes how best practices are applied. The presentation therefore introduces and explains which Oracle tools can and should be used to apply best practices. It also discusses "smart features" that one will benefit from automatically after upgrading to Oracle RAC 12c Rel. 2. This presentation was first presented during UKOUG Tech17.
In this session we want to explore the various ways you can setup a connection strategy. We'll start with Oracle's UCP (Universal Connection Pool), its architecture and most notably, how do you size it? We'll discuss important concepts such as: connection reservation, and the distinction between connection, process and session.
Besides UCP there are: Database Resident Connection Pool (DRCP) and Proxy Resident Connection Pool (PRCP). Which will both be discussed. We'll also look into combining different types of pools: what are their typical use-cases, and what are the pitfalls?
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
Apache Kafka를 이용하여 이미지 데이터를 얼마나 빠르게(with low latency) 전달 가능한지 성능 테스트.
최종 목적은 AI(ML/DL) 모델의 입력으로 대량의 실시간 영상/이미지 데이터를 전달하는 메세지 큐로 사용하기 위하여, Drone/제조공정 등의 장비에서 전송된 이미지를 얼마나 빨리 AI Model로 전달 할 수 있는지 확인하기 위함.
그래서 Kafka에서 이미지를 전송하는 간단한 테스트를 진행하였고,
이 과정에서 latency를 얼마나 줄여주는지를 확인해 보았다.(HTTP 프로토콜/Socket과 비교하여)
[현재 까지 결론]
- Apache Kafka는 대량의 요청 처리를 위한 throughtput에 최적화 된 솔루션임.
- 현재는 producer의 몇가지 옵션만 조정하여 테스트한 결과이므로,
- 잠정적인 결과이지만, kafka의 latency를 향상을 위해서는 많은 시도가 필요할 것 같음.
- 즉, 단일 요청의 latency는 확실히 느리지만,
- 대량의 처리를 기준으로 평균 latency를 비교하면 평균적인 latency는 많이 낮아짐.
Test Code : https://github.com/freepsw/kafka-latency-test
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
Building an Event Streaming Architecture with Apache PulsarScyllaDB
What is Apache Pulsar? How does it differ from other event streaming technologies available? StreamNative Developer Advocate Tim Spann will walk you through the features and architecture of this increasingly popular event streaming system, along with best practices for streaming and storing your data.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
Presented at the webinar, July 31, 2019
Built-in replication is a powerful ClickHouse feature that helps scale data warehouse performance as well as ensure high availability. This webinar will introduce how replication works internally, explain configuration of clusters with replicas, and show you how to set up and manage ZooKeeper, which is necessary for replication to function. We'll finish off by showing useful replication tricks, such as utilizing replication to migrate data between hosts. Join us to become an expert in this important subject!
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
Compare top PostgreSQL high availability frameworks - PostgreSQL Automatic Failover (PAF), Replication Manager (repmgr) and Patroni to improve your app uptime. ScaleGrid blog - https://scalegrid.io/blog/whats-the-best-postgresql-high-availability-framework-paf-vs-repmgr-vs-patroni-infographic/
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd
Graham Mainwaring and Robert Hodges summarize management of ClickHouse on Kubernetes using the ClickHouse Kubernetes Operator and introduce a new UI for it. Presented at the 15 Dec '22 SF Bay Area ClickHouse Meetup.
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
This talk was given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn) at the ACM SIGMOD/PODS Conference (June 2013). For the paper written by the LinkedIn Espresso Team, go here:
http://www.slideshare.net/amywtang/espresso-20952131
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
Scaling Machine Learning Systems - (Braxton McKee, CEO & Founder, Ufora)
Braxton is the technical lead and founder of Ufora, a software company that develops Pyfora, an automatically parallel implementation of the Python programming language that enables data science and machine-learning at scale. Before founding Ufora with backing from Two Sigma Ventures and others, Braxton led the ten-person MBS/ABS Credit Modeling team at Ellington Management Group, a multi-billion dollar mortgage hedge fund. He holds a BS (Mathematics), MS (Mathematics), and M.B.A. from Yale University.
Braxton will discuss scaling machine learning applications using the open-source platform Pyfora. He will describe both the general approach and also some specific engineering techniques employed in the implementation of Pyfora that make it possible to produce large-scale machine learning and data science programs directly from single-threaded Python code.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
Genomics and life sciences is using antiquated technology for processing data. As the data volume is increasing in the life sciences, many in the biology community are reinventing the wheel, without realizing the existence of a rich ecosystem of tools for processing large data sets: Hadoop.
This case study gives an inside look at optimization of the MongoDB Perl driver, including custom benchmarking tools, step-by-step changes and results that will surprise and amaze. If you ever needed to optimize some Perl and wondered how people go about it, this talk is for you.
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Eventually Consistent Data Structures (from strangeloop12)Sean Cribbs
There are many reasons to use an eventually-consistent database — like Riak, Voldemort, or Cassandra — including increased availability, lower latency, and fault-tolerance. However, doing so requires a mental shift in how to structure client applications, and certain types of traditional data-structures, like sets, registers, and counters can’t be resolved simply in the face of race-conditions. It is difficult to achieve “logical monotonicity” except for the most trivial data-types.
That is, until the advent of Convergent Replicated Data Types (CRDTs). CRDTs are data-structures that tolerate eventual consistency. They replace traditional data-structure implementations and all have the property that, given any number of conflicting versions of the same datum, there is a single state on which they converge (monotonicity). This talk will discuss some of the most useful CRDTs and how to apply them to solve real-world data problems.
There are many reasons to use an eventually-consistent database -- like Riak, Voldemort, or Cassandra -- including increased availability, lower latency, and fault-tolerance. However, doing so requires a mental shift in how to structure client applications, and certain types of traditional data-structures, like sets, registers, and counters can't be resolved simply in the face of race-conditions. It is difficult to achieve "logical monotonicity" except for the most trivial data-types.
That is, until the advent of Conflict-Free Replicated Data Types (CRDTs). CRDTs are data-structures that tolerate eventual consistency. They replace traditional data-structure implementations and all have the property that, given any number of conflicting versions of the same datum, there is a single state on which they converge (monotonicity). This talk will discuss some of the most useful CRDTs and how to apply them to solve real-world data problems.
Towards a General Approach for Symbolic Model-Checker PrototypingEdmundo López Bóbeda
We propose a novel approach to prototype and create symbolic model-checkers. Our approach focuses on providing a high level abstraction above Decision Diagrams. It allows the model-checker creator to start from a high level formal semantics and to define an efficient Decision Diagram based model-checker.
Similar to Dynamic Reconfiguration of Apache ZooKeeper (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
2. Why ZooKeeper?
•
Lots of servers
•
Lots of processes
•
High volumes of data
•
Highly complex software systems
•
… mere mortal developers
3. What ZooKeeper gives you
● Simple programming model
● Coordination of distributed processes
● Fast notification of changes
● Elasticity
● Easy setup
● High availability
4. ZooKeeper Configuration
• Membership
• Role of each server
– E.g., follower or observer
• Quorum System spec
– Zookeeper: majority or hierarchical
• Network addresses & ports
• Timeouts, directory paths, etc.
5. Zookeeper - distributed and replicated
ZooKeeper Service
Leader
Server Server Server Server Server
Client Client Client Client Client Client Client Client
• All servers store a copy of the data (in memory)
• A leader is elected at startup
• Reads served by followers, all updates go through leader
• Update acked when a quorum of servers have persisted the
change (on disk)
• Zookeeper uses ZAB - its own atomic broadcast protocol
6. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
7. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
8. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
9. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
10. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
11. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
12. Hazards of Manual Reconfiguration
E
A
C
{A, B, C}
B {A, B, C} D
{A, B, C}
• Goal: add servers E and D
13. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
14. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
15. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
16. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
17. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
• Lost and !
18. 18
Just use a coordination service!
• Zookeeper is the coordination service
– Don’t want to deploy another system to coordinate it!
• Who will reconfigure that system ?
– GFS has 3 levels of coordination services
• More system components -> more management overhead
• Use Zookeeper to reconfigure itself!
– Other systems store configuration information in Zookeeper
– Can we do the same??
– Only if there are no failures
23. This doesn’t work for reconfigurations!
E
C
B
{A, B, C, D, E} {A, B, C, D, E}
setData(/zookeeper/config, {A, B, F})
{A, B, C, D, E} D
remove C, D, E add F
F
{A, B, C, D, E}
A
{A, B, C, D, E}
24. This doesn’t work for reconfigurations!
E
C
B
{A, B, C, D, E} {A, B, C, D, E}
setData(/zookeeper/config, {A, B, F})
{A, B, C, D, E} D
remove C, D, E add F
F
{A, B, C, D, E}
A
{A, B, F}
{A, B, F}
25. This doesn’t work for reconfigurations!
E
C
B
{A, B, C, D, E} {A, B, C, D, E}
setData(/zookeeper/config, {A, B, F})
{A, B, C, D, E} D
remove C, D, E add F
F
{A, B, C, D, E}
A
{A, B, F}
{A, B, F}
• Must persist the decision to reconfigure in the old
config before activating the new config!
• Once such decision is reached, must not allow further
ops to be committed in old config
26. Our Solution
• Correct
• Fully automatic
• No external services or additional components
• Minimal changes to Zookeeper
• Usually unnoticeable to clients
– Pause operations only in rare circumstances
– Clients work with a single configuration
• Rebalances clients across servers in new configuration
• Reconfigures immediately
• Speculative Reconfiguration
– Reconfiguration (and commands that follow it) speculatively sent out by the
primary, similarly to all other updates
27. Principles
● Commit reconfig in a quorum of the old ensemble
– Submit reconfig op just like any other update
● Make sure new ensemble has latest state before
becoming active
– Get quorum of synced followers from new config
– Get acks from both old and new ensembles before committing
updates proposed between reconfig op and activation
– Activate new configuration when reconfig commits
● Once new ensemble active old ensemble cannot commit
or propose new updates
● Gossip activation through leader election and syncing
● Verify configuration id of leader and follower
29. Reconfiguration scenario 1
E
A
C
{A, B, C} {A, B, C}
B {A, B, C} D
{A, B, C}
{A, B, C}
• Goal: add servers E and D
30. Reconfiguration scenario 1
E
A
C
{A, B, C}
B {A, B, C} D
{A, B, C}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
31. Reconfiguration scenario 1
E
A
C
{A, B, C} {A, B, C}
B {A, B, C} D
{A, B, C}
{A, B, C}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
32. Reconfiguration scenario 1
E
A
C
{A, B, C} {A, B, C}
B {A, B, C} D
{A, B, C}
{A, B, C}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
33. Reconfiguration scenario 1
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
34. Reconfiguration scenario 1
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
• E and D gossip new configuration
to C
35. Reconfiguration scenario 1
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
• E and D gossip new configuration
to C
36. Example - reconfig using CLI
reconfig -add 1=host1.com:1234:1235:observer;1239
-add 2=host2.com:1236:1237:follower;1231 -remove 5
●
Change follower 1 to an observer and change its ports
●
Add follower 2 to the ensemble
●
Remove follower 5 from the ensemble
reconfig -file myNewConfig.txt -v 234547
●
Change the current config to the one in myNewConfig.txt
●
But only if current config version is 234547
getConfig -w -c
●
set a watch on /zookeeper/config
●
-c means we only want the new connection string for clients
37. When it will not work
● Quorum of new ensemble must be in sync
● Another reconfig in progress
● Version condition check fails
38. How do you know you are done
● Write something somewhere
39. The “client side” of reconfiguration
• When system changes, clients need to stay connected
– The usual solution: directory service (e.g., DNS)
• Re-balancing load during reconfiguration is also important!
• Goal: uniform #clients per server with minimal client migration
– Migration should be proportional to change in membership
X 10 X 10 X 10
40. The “client side” of reconfiguration
• When system changes, clients need to stay connected
– The usual solution: directory service (e.g., DNS)
• Re-balancing load during reconfiguration is also important!
• Goal: uniform #clients per server with minimal client migration
– Migration should be proportional to change in membership
X 10 X 10 X 10
41. Our approach - Probabilistic Load Balancing
• Example 1 :
X 10 X 10 X 10
42. Our approach - Probabilistic Load Balancing
• Example 1 :
X 10 X 10 X 10
43. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
44. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
X6 X6 X6 X6 X6
45. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
X6 X6 X6 X6 X6
46. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
4/18 4/18 10/18
X6 X6 X6 X6 X6
– Connected clients don’t move
– Disconnected clients move to old servers with prob 4/18 and new one with prob
10/18
– Exp. 8 clients will move from A, B, C to D, E and 10 to F
47. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
4/18 4/18 10/18
X 10 X 10 X 10
– Connected clients don’t move
– Disconnected clients move to old servers with prob 4/18 and new one with prob
10/18
– Exp. 8 clients will move from A, B, C to D, E and 10 to F
49. ProbabilisticCurrent Load Balancing
When moving from config. S to S’:
E (load (i, S ' )) = load (i, S ) + ∑ load ( j, S ) ⋅ Pr( j → i ) − load (i, S ) ∑ Pr(i → j )
j∈S ∧ j ≠i j∈S ' ∧ j ≠i
expected #clients #clients
connected to i in S’ connected #clients
(10 in last example) to i in S #clients
moving to i from moving from i to
other servers in S other servers in S’
Solving for Pr we get case-specific probabilities.
Input: each client answers locally
Question 1: Are there more servers now or less ?
Question 2: Is my server being removed?
Output: 1) disconnect or stay connected to my server
if disconnect 2) Pr(connect to one of the old servers)
and Pr(connect to newly added server)
50. Implementation
• Implemented in Zookeeper (Java & C), integration ongoing
– 3 new Zookeeper API calls: reconfig, getConfig, updateServerList
– feature requested since 2008, expected in 3.5.0 release (july 2012)
• Dynamic changes to:
– Membership
– Quorum System
– Server roles
– Addresses & ports
• Reconfiguration modes:
– Incremental (add servers E and D, remove server B)
– Non-incremental (new config = {A, C, D, E})
– Blind or conditioned (reconfig only if current config is #5)
• Subscriptions to config changes
– Client can invoke client-side re-balancing upon change
51. 52
Summary
• Design and implementation of reconfiguration for Apache Zookeeper
– being contributed into Zookeeper codebase
• Much simpler than state of the art, using properties already provided by Zookeeper
• Many nice features:
– Doesn’t limit concurrency
– Reconfigures immediately
– Preserves primary order
– Doesn’t stop client ops
– Zookeeper used by online systems, any delay must be avoided
– Clients work with a single configuration at a time
– No external services
– Includes client-side rebalancing