This document discusses lessons learned from building a scalable, self-serve, real-time, multi-tenant monitoring service at Yahoo. It describes transitioning from a classical architecture to one based on real-time big data technologies like Storm and Kafka. Key lessons include properly handling producer-consumer problems at scale, challenges of debugging skewed data, strategically managing multi-tenancy and resources, issues optimizing asynchronous systems, and not neglecting assumptions outside the application.
The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.
Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
This document summarizes a presentation about using the Crail distributed storage system to improve Spark performance on high-performance computing clusters with RDMA networking and NVMe flash storage. The key points are:
1) Traditional Spark storage and networking APIs do not bypass the operating system kernel, limiting performance on modern hardware.
2) The Crail system provides user-level APIs for RDMA networking and NVMe flash to improve Spark shuffle, join, and sorting workloads by 2-10x on a 128-node cluster.
3) Crail allows Spark workloads to fully utilize high-speed networks and disaggregate memory and flash storage across nodes without performance penalties.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
Parquet is a columnar storage format that provides efficient compression and querying capabilities. It aims to store data efficiently for analysis while supporting interoperability across systems. Parquet uses column-oriented storage with efficient encodings and statistics to enable fast querying of large datasets. It integrates with many query engines and frameworks like Hive, Impala, Spark and MapReduce to allow projection and predicate pushdown for optimized queries.
The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.
Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
This document summarizes a presentation about using the Crail distributed storage system to improve Spark performance on high-performance computing clusters with RDMA networking and NVMe flash storage. The key points are:
1) Traditional Spark storage and networking APIs do not bypass the operating system kernel, limiting performance on modern hardware.
2) The Crail system provides user-level APIs for RDMA networking and NVMe flash to improve Spark shuffle, join, and sorting workloads by 2-10x on a 128-node cluster.
3) Crail allows Spark workloads to fully utilize high-speed networks and disaggregate memory and flash storage across nodes without performance penalties.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
Parquet is a columnar storage format that provides efficient compression and querying capabilities. It aims to store data efficiently for analysis while supporting interoperability across systems. Parquet uses column-oriented storage with efficient encodings and statistics to enable fast querying of large datasets. It integrates with many query engines and frameworks like Hive, Impala, Spark and MapReduce to allow projection and predicate pushdown for optimized queries.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
An Open Source Incremental Processing Framework called Hoodie is summarized. Key points:
- Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans.
- It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data.
- Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads.
- The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Cisco connect toronto 2015 big data sean mc keownCisco Canada
The document provides an overview of big data concepts and architectures. It discusses key topics like Hadoop, HDFS, MapReduce, NoSQL databases, and MPP relational databases. It also covers network design considerations for big data, common traffic patterns in Hadoop, and how to optimize performance through techniques like data locality and quality of service policies.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
This document discusses the evolution of data pipelines at Databricks over time from 2014 to present day. Early pipelines involved copying data from S3 hourly, which did not scale. Later pipelines used Amazon Kinesis but led to performance issues with many small files. The document then introduces structured streaming and Delta Lake as better solutions. Structured streaming provides correctness while Delta Lake improves performance, scalability, and makes data management and GDPR compliance easier through features like ACID transactions, automatic schema management, and built-in deletion/update support.
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
As the operator of the dominant messenger application in South Korea, KakaoTalk has more than 170 million users, and our ever-growing graph has more than 10B edges and 200M vertices. This scale presents several technical challenges for storing and querying the graph data, but we have resolved them by creating a new distributed graph database with HBase. Here you'll learn the methodology and architecture we used to solve the problems, compare it another famous graph database, Titan, and explore the HBase issues we encountered.
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
Apache Drill is an interactive SQL query engine for analyzing large scale datasets. It allows for querying data stored in HBase and other data sources. Drill uses an optimistic execution model and late binding to schemas to enable fast queries without requiring metadata definitions. It leverages recent techniques like vectorized operators and late record materialization to improve performance. The project is currently in alpha stage but aims to support features like nested queries, Hive UDFs, and optimized joins with HBase.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
This document proposes a container-based sizing framework for Apache Hadoop/Spark clusters that uses a multi-objective genetic algorithm approach. It emulates container execution on different cloud platforms to optimize configuration parameters for minimizing execution time and deployment cost. The framework uses Docker containers with resource constraints to model cluster performance on various public clouds and instance types. Optimization finds Pareto-optimal configurations balancing time and cost across objectives.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop created by eBay to address limitations of existing tools in handling large volumes of metrics and logs from Hadoop clusters. It provides data activity monitoring, job performance monitoring, and unified monitoring. Eagle detects anomalies using machine learning algorithms and notifies users through alerts. It has been deployed across multiple eBay clusters with over 10,000 nodes and processes hundreds of thousands of events per day.
Foundations of streaming SQL: stream & table theoryDataWorks Summit
The document provides an overview of streaming SQL and time-varying relations. It discusses:
1) How relations evolve over time in streaming SQL, with data divided into time intervals. This allows querying the relation at any point in time.
2) The closure properties of relational algebra still apply to time-varying relations. Operations like filtering and grouping can be performed on intervals of the relation.
3) Streaming SQL extends classic SQL to handle continuous queries over streaming data, represented as time-varying relations divided into time-based intervals.
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
Hadoop Eagle is a full-stack realtime monitoring framework for eBay's Hadoop clusters. It uses task failure ratios to detect node anomalies, and monitors jobs, performance, and metrics across clusters in real-time. The framework addresses challenges of monitoring eBay's large Hadoop environment, which includes 10+ clusters, 10,000+ data nodes, and processing of 50 million+ tasks per day.
Human: Thank you, that was a good high level summary that captured the key points about Hadoop Eagle in 3 sentences.
This talk takes you on a rollercoaster ride through Hadoop 2 and explains the most significant changes and components.
The talk has been held on the JavaLand conference in Brühl, Germany on 25.03.2014.
Agenda:
- Welcome Office
- YARN Land
- HDFS 2 Land
- YARN App Land
- Enterprise Land
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
LLAP enables sub-second analytical queries in Hive by running query fragments directly in memory on compute nodes using a long-running daemon process. It provides high performance scans and execution through an in-memory columnar cache shared across queries. LLAP queries are coordinated independently by Tez while utilizing Hive operators for processing and Tez for data transfers. It improves upon traditional MapReduce and Tez by keeping intermediate query results in memory rather than writing to disk.
C19013010 the tutorial to build shared ai services session 2Bill Liu
This document provides an agenda and overview for a tutorial on building shared AI services. The session will cover AI engineering platforms, data pipelines, traditional AI roles and their challenges, skills required for AI engineers, and benchmarking machine learning and deep learning approaches. It includes a live demo of building an end-to-end AI pipeline with Kafka, NiFi, Spark Streaming and Keras on Spark.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
An Open Source Incremental Processing Framework called Hoodie is summarized. Key points:
- Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans.
- It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data.
- Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads.
- The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Cisco connect toronto 2015 big data sean mc keownCisco Canada
The document provides an overview of big data concepts and architectures. It discusses key topics like Hadoop, HDFS, MapReduce, NoSQL databases, and MPP relational databases. It also covers network design considerations for big data, common traffic patterns in Hadoop, and how to optimize performance through techniques like data locality and quality of service policies.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
This document discusses the evolution of data pipelines at Databricks over time from 2014 to present day. Early pipelines involved copying data from S3 hourly, which did not scale. Later pipelines used Amazon Kinesis but led to performance issues with many small files. The document then introduces structured streaming and Delta Lake as better solutions. Structured streaming provides correctness while Delta Lake improves performance, scalability, and makes data management and GDPR compliance easier through features like ACID transactions, automatic schema management, and built-in deletion/update support.
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
As the operator of the dominant messenger application in South Korea, KakaoTalk has more than 170 million users, and our ever-growing graph has more than 10B edges and 200M vertices. This scale presents several technical challenges for storing and querying the graph data, but we have resolved them by creating a new distributed graph database with HBase. Here you'll learn the methodology and architecture we used to solve the problems, compare it another famous graph database, Titan, and explore the HBase issues we encountered.
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
Apache Drill is an interactive SQL query engine for analyzing large scale datasets. It allows for querying data stored in HBase and other data sources. Drill uses an optimistic execution model and late binding to schemas to enable fast queries without requiring metadata definitions. It leverages recent techniques like vectorized operators and late record materialization to improve performance. The project is currently in alpha stage but aims to support features like nested queries, Hive UDFs, and optimized joins with HBase.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
This document proposes a container-based sizing framework for Apache Hadoop/Spark clusters that uses a multi-objective genetic algorithm approach. It emulates container execution on different cloud platforms to optimize configuration parameters for minimizing execution time and deployment cost. The framework uses Docker containers with resource constraints to model cluster performance on various public clouds and instance types. Optimization finds Pareto-optimal configurations balancing time and cost across objectives.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop created by eBay to address limitations of existing tools in handling large volumes of metrics and logs from Hadoop clusters. It provides data activity monitoring, job performance monitoring, and unified monitoring. Eagle detects anomalies using machine learning algorithms and notifies users through alerts. It has been deployed across multiple eBay clusters with over 10,000 nodes and processes hundreds of thousands of events per day.
Foundations of streaming SQL: stream & table theoryDataWorks Summit
The document provides an overview of streaming SQL and time-varying relations. It discusses:
1) How relations evolve over time in streaming SQL, with data divided into time intervals. This allows querying the relation at any point in time.
2) The closure properties of relational algebra still apply to time-varying relations. Operations like filtering and grouping can be performed on intervals of the relation.
3) Streaming SQL extends classic SQL to handle continuous queries over streaming data, represented as time-varying relations divided into time-based intervals.
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
Hadoop Eagle is a full-stack realtime monitoring framework for eBay's Hadoop clusters. It uses task failure ratios to detect node anomalies, and monitors jobs, performance, and metrics across clusters in real-time. The framework addresses challenges of monitoring eBay's large Hadoop environment, which includes 10+ clusters, 10,000+ data nodes, and processing of 50 million+ tasks per day.
Human: Thank you, that was a good high level summary that captured the key points about Hadoop Eagle in 3 sentences.
This talk takes you on a rollercoaster ride through Hadoop 2 and explains the most significant changes and components.
The talk has been held on the JavaLand conference in Brühl, Germany on 25.03.2014.
Agenda:
- Welcome Office
- YARN Land
- HDFS 2 Land
- YARN App Land
- Enterprise Land
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
LLAP enables sub-second analytical queries in Hive by running query fragments directly in memory on compute nodes using a long-running daemon process. It provides high performance scans and execution through an in-memory columnar cache shared across queries. LLAP queries are coordinated independently by Tez while utilizing Hive operators for processing and Tez for data transfers. It improves upon traditional MapReduce and Tez by keeping intermediate query results in memory rather than writing to disk.
Similar to Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo
C19013010 the tutorial to build shared ai services session 2Bill Liu
This document provides an agenda and overview for a tutorial on building shared AI services. The session will cover AI engineering platforms, data pipelines, traditional AI roles and their challenges, skills required for AI engineers, and benchmarking machine learning and deep learning approaches. It includes a live demo of building an end-to-end AI pipeline with Kafka, NiFi, Spark Streaming and Keras on Spark.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Real-Time Analytics With StarRocks (DWH+DL).pdfAlbert Wong
StarRocks, an open-source distributed columnar engine for real-time analytics, is carving its niche in the big data landscape. Its ability to handle high-velocity data streams and deliver blazing-fast query responses makes it a compelling choice for modern analytics workloads. Let's delve into the intricacies of real-time analytics with StarRocks and explore its capabilities.
Data Ingestion:
The journey begins with ingesting data into StarRocks. It supports a variety of real-time data sources, including Kafka, Pulsar, and custom streaming protocols. These integrations allow seamless data flow from streaming sources to StarRocks, ensuring minimal latency.
Stream Processing and Storage:
StarRocks employs a hybrid architecture for real-time processing. Incoming data streams are first processed by lightweight stream engines like Flink or Spark. These engines perform initial aggregations and transformations, preparing the data for efficient storage in StarRocks' columnar format. This format facilitates rapid data retrieval and filtering, crucial for real-time querying.
Real-time Querying:
The true power of StarRocks lies in its real-time query engine. Once data lands in StarRocks, users can leverage SQL-like queries to analyze it with minimal lag. StarRocks optimizes queries by exploiting its columnar storage and parallel processing capabilities. This enables sub-second response times for even complex queries, empowering users to gain immediate insights from their data.
Advanced Features:
StarRocks packs several features that further enhance its real-time analytics prowess. Materialized views act as pre-computed summaries of data, accelerating frequently used queries. Additionally, StarRocks' automatic tiered storage seamlessly migrates less frequently accessed data to cost-effective storage solutions, optimizing resource utilization.
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Speeding Time to Insight with a Modern ELT ApproachDatabricks
The availability of new tools in the modern data stack is changing the way data teams operate. Specifically, the modern data stack supports an “ELT” approach for managing data, rather than the traditional “ETL” approach. In an ELT approach, data sources are automatically loaded in a normalized state into Delta Lake and opinionated transformations happen in the data destination using dbt. This workflow allows data analysts to move more quickly from raw data to insight, while creating repeatable data pipelines robust to changes in the source datasets. In this presentation, we’ll illustrate how easy it is for even a data analytics team of one to to develop an end-to-end data pipeline. We’ll load data from GitHub into Delta Lake, then use pre-built dbt models to feed a daily Redash dashboard on sales performance by manager, and use the same transformed models to power the data science team’s predictions of future sales by segment.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
Discover how to avoid common pitfalls when shifting to an event-driven architecture (EDA) in order to boost system recovery and scalability. We cover Kafka Schema Registry, in-broker transformations, event sourcing, and more.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems.
Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you.
Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh.
Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It begins by introducing the speaker and GigaSpaces. It then discusses how modern applications require both online transaction processing and real-time operational intelligence. The document presents examples from retail and IoT and the goals of minimizing latency while maximizing data analytics locality. It provides an overview of in-memory computing options and describes how GigaSpaces uses an in-memory data grid combined with Spark to achieve HTAP. The document includes deployment diagrams and discusses data grid RDDs and pushing predicates to the data grid. It describes how this was productized as InsightEdge and provides additional innovations and reference architectures.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
During this webinar, we will review best practices and lessons learned from working with large and mid-size companies on their deployment of PostgreSQL. We will explore the practices that helped industry leaders move through these stages quickly, and get as much value out of PostgreSQL as possible without incurring undue risk.
We have identified a set of levers that companies can use to accelerate their success with PostgreSQL:
- Application Tiering
- Collaboration between DBAs and Development Teams
- Evangelizing
- Standardization and Automation
- Balance of Migration and New Development
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
Similar to Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo (20)
This document discusses Hadoop at Yahoo, including:
- Yahoo has built a large multi-tenant Apache Hadoop deployment that powers many of its businesses and use cases.
- Over the years, Yahoo has scaled its Hadoop infrastructure significantly, now consisting of over 50,000 servers and 50PB of storage.
- Yahoo uses Hadoop for a wide range of use cases across advertising, search, personalization, anti-spam, and more, processing data at massive scales of billions of records daily.
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
Over the past year, a lot of progress has been made in advancing the Apache Hadoop platform at Yahoo. We underwent a massive infrastructure consolidation to lower the platform TCO. CaffeOnSpark was open-sourced for distributed deep learning on existing infrastructure with a combination of CPU and GPU-based computing. Traditional compute on MapReduce continues to shift to Apache Tez and Apache Spark for lower processing time. Our internal security, multi-tenancy, and scale changes to Apache Storm got pushed to the community in Storm 0.10. Omid was open-sourced for managing transactions reliably on Apache HBase. Multi-tenancy with region groups, splittable META, ZooKeeper-less assignment manager, favored nodes with HDFS block placement, and support for humongous tables have taken Apache HBase scale to new heights. Dependency management in Apache Oozie for combinatorial, conditional, and optional processing gives increased flexibility to our data pipelines teams in maintaining SLAs. Focus on ease of use and onboarding improvements have brought in a whole new class of use cases and users to the platform. In this talk, we will provide a comprehensive overview of the platform technology stack, recent developments, metrics, and share thoughts on where things are headed when it comes to big data at Yahoo.
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In this talk, Sumeet Singh will present some of the recent innovations, open source contributions, and where things are headed when it comes to Hadoop at Yahoo.
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Senior Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
Hadoop has allowed us to move towards a unified source of truth for all of organization’s data. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs will become critical with increasing scale of operations.
In this talk, we will share an approach in tackling the above challenges. We will explain how to register existing HDFS files, provide broader but controlled access to data through a data discovery tool with schema browse and search functionality, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data. In addition, the approach allows us to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools. As we discuss our approach, we will also highlight how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
URL: http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38768
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
In this talk, we look at YARN scheduler choices available today for Apache Hadoop 2 and discuss their pros and cons. We dive deeper into Capacity Scheduler by providing a comprehensive overview of its various settings with examples from real large-scale Hadoop clusters to promoter a broader understanding of schedulers’ current state and best practices in place today when it comes to queue nomenclature, planning, allocations, and ongoing management. We present detailed cluster, queue, and job behaviors from several different capacity management philosophies.
We then propose practical solutions without any change to the scheduler or core Hadoop that allows managing queue creations and capacity allocations while optimizing for cluster utilization and maintaining SLA guarantees. A unified queue nomenclature, admission and capacity re-allocation policies across BUs, applications, and clusters make service automation possible. Transparency in resources consumed allows for defining realistic SLA expectation. Finally, consistent application tagging completes the feedback loop with SLAs observed through application level reporting.
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo's Hadoop clusters. A key component that enables this efficient operation is data compression.
With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented.
The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on "Big Data" who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
The Hadoop project is an integral part of Yahoo!'s cloud infrastructure and is at the heart of many of Yahoo!'s important business processes. Sumeet Singh, the Head of Products for Cloud Services and Hadoop at Yahoo!, explains how Yahoo! leverages Hadoop and Cloud Platforms to process and serve Internet- scale data.
Yahoo! operates one of the world's largest private cloud infrastructures. Learn how technologies scale out for building enterprise-wide trusted platforms with tight SLAs.
URL: http://www.saptechnologyservice.com/track1.html
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
Cloud-based architectures of Hadoop have made it attractive for public cloud service providers to offer hosted Hadoop services and charge customers on a pay-for-what-you-use basis. For enterprises that have already adopted Hadoop, the data infrastructure has long been seen as a cost element in their budgets. As a result, enterprises thinking of adopting Hadoop are increasingly debating between on-premise and cloud-based models for their data processing needs.
We lay out a set of criteria and methodical approaches to help enterprises that have not yet adopted Hadoop evaluate their options, and discuss the pros and cons of both models. For enterprises that have already made significant investments or have plans to build a Hadoop-based infrastructure, we present an approach to manage Hadoop as a Service with a P&L, transparency in costs, and metering & billing provisions.
As we discuss these approaches, we will share insights gathered from the exercise conducted on one of the largest Hadoop footprints in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs for usage, measure the resource usage for services, optimize for higher utilization, and benchmark costs.
URL: http://strataconf.com/stratany2013/public/schedule/detail/30824
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
Yahoo! has been using HBase for a long time in isolated instances, most notably for the personalization platform powering its homepage experiences. The introduction of multi-tenancy has lowered the barriers for all Hadoop users to use HBase. We will cover traditional use cases for HBase at Yahoo!, and new use cases as a result in content management, advertising, log processing, analytics and reporting, recommendation graphs, and dimension data stores.
We will then talk about the deployment strategy and enhancements made that facilitate multi-tenancy. Region Server groups provide a coarse level of isolation among tenants by designating a subset of region servers to serve designated tables, and Namespaces for logical grouping of resources (region servers, tables) and privileges (quota, ACLs).
We'll also share our experiences in operating HBase with security enabled and contributions made in this area, and results from performance runs conducted to validate customer expectations in a multi-tenant environment.
URL: http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--multi-tenant-apache-hbase-at-yahoo-video.html
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Building Production Ready Search Pipelines with Spark and Milvus
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo
1. Lessons Learned Building A Scalable
Self-serve, Real-time, Multi-tenant
Monitoring Service
PRESENTED BY Mridul Jain, Sumeet Singh⎪ March 31, 2016
S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 6 , S a n J o s e
2. Introduction
2
§ Big ML at Yahoo
§ Has used Storm and Kafka for real-time trend
analysis in search and central monitoring
§ Co-authored Pig on Storm
§ Co-authored CaffeOnSpark for distributed deep
learning
Mridul Jain
Senior Principal Architect
Big Data and Machine Learning
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@mridul_jain
§ Manages Hadoop products team at Yahoo
§ Responsible for Product Management, Strategy and
Customer Engagements
§ Managed Cloud Services products team and headed
strategy functions for the Cloud Platform Group at
Yahoo
§ MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Sr. Director, Product Management
Cloud and Big Data Platforms
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
3. Acknowledgement
3
We want to acknowledge the contributions from Kapil Gupta and Arun Gupta,
Principal Architects with the Yahoo Monitoring team to this presentation as well
as the monitoring platform.
We would also like to thank the entire Yahoo Monitoring and Hadoop and
Big Data Platforms teams for making the next generation monitoring services
a reality at Yahoo.
4. Agenda
4
Overview1
Transitioning from Classical to Real-time Big Data Architecture
Lessons Learned Scaling the Real-time Big Data Stack
Lessons Learned Optimizing for System Performance
Q&A
2
3
4
5
5. Introduction to Yahoo’s Monitoring as a Service
5
...
...
Infra Monitoring
CPU, disk, network
Host uptime
HTTP sess. errors
Hosts
Apps
App Monitoring
Req. per second
Avg. latency
API access errors
Hosted Multi-tenant
Monitoring
Service
Collection
Storage
Scheduling
Coordination
Alerts /
Thresholds
Dashboards
Aggregation
6. Classical Architecture – Pre Real-time Big Data Tech
6
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
7. Classical Architecture – Pre Real-time Big Data Tech
7
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
8. Classical Architecture – Pre Real-time Big Data Tech
8
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
9. Classical Architecture – Pre Real-time Big Data Tech
9
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
Massive Query Federation3
10. Classical Architecture – Pre Real-time Big Data Tech
10
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
Massive Query Federation3
✗ Manageability Challenges
11. Classical Architecture – Pre Real-time Big Data Tech
11
H1
H2
H3
H4
H5
Collector Aggregator
Server
DB Server
Dashboard
12. Classical Architecture – Pre Real-time Big Data Tech
12
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1
13. Classical Architecture – Pre Real-time Big Data Tech
13
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1 Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2
14. Classical Architecture – Pre Real-time Big Data Tech
14
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3
15. Classical Architecture – Pre Real-time Big Data Tech
15
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4
16. Classical Architecture – Pre Real-time Big Data Tech
16
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4
✗ Scale Challenges ✗ Availability Challenges
17. Architecture Based on Real-time Big Data Tech
17
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
18. Architecture Based on Real-time Big Data Tech
18
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
No manual partitioning / sharing
Built-in horizontal scalability
Built-in High-availability
✔ Manageability
✔ Scalability
✔ Availability
Standard Big Data Frameworks
24. Lessons Learned
24
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
25. Lessons Learned
25
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
26. Storm + Kafka Based Architecture
26
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST
27. Scale of an Online Monitoring Solution
27
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST
§ 400 bolt tasks in 40
workers
TSDB_1
TSDB_2
TSDB_3
§ 450 topologies
§ 15 topics /topology
§ 3 partitions /topic
§ 3 TSDB topics
§ 222 partitions per
topic
29. A Producer - Consumer Pipeline
29
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§ Excellent E2E Synchronization
§ Provides a breather against individual component failures
§ Reasonably good performance inspite of transient failures
§ Can help individual components to scale, if used smartly
30. Monitoring Time Roll-ups
30
Topic in-mem state
Kafka Cluster
Spout Bolt
Storm
Topic in-mem state
Topic in-mem state
§ Huge in-memory state
§ 220 million/min * 60
§ Trident issues
§ High network à high CPU
31. Monitoring Time Roll-ups
31
Topic in-mem state
Kafka Cluster
Spout
Storm
Topic in-mem state
Topic in-mem state
§ Aggregate in Spout
§ 220 million/min * 60
§ Fields grouping in kafka for a time series
Producer
32. Kafka Refresh
32
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
33. Kafka Refresh
33
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
34. Kafka Refresh
34
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
35. Kafka Refresh
35
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§ If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
36. Kafka Refresh
36
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§ If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
§ Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
37. Kafka Refresh
37
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§ If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
§ Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Disable refresh
If broker is down anyway the
producer apis get it from an
alternate broker
38. A Producer - Consumer Pipeline
38
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§ Excellent E2E Synchronization
§ Provides a breather against individual component failures
§ Reasonably good performance inspite of transient failures
§ Can help individual components to scale, if used smartly
§ Queuing system is your last line of defense, choose wisely
39. Lessons Learned
39
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
40. Skewed Ingestion per Task
40
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
High rate of ingestion with a “Group By” on limited dimensions will direct all
events for a specific dimension to one task
41. Skewed Ingestion per Task
41
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
Overall state per task reduces due to combiners sharing the original big state and
also aggregating it before fwding to final bolts, thus reducing their overall state
Each of the combiners maintain local
state for each of the dimensions and
fwds the aggregated count to B1 or B2
com 1
com 2
com 3
Shuffle Partition By
42. Abuse
42
§ Max ingestion per TSDB - 120k/s
§ UID table hit hard due to high cardinality data
§ Lots of in-memory states created in Storm bolts
43. Lessons Learned
43
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
44. ZooKeeper Scaling
44
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
ZK - Storm
§ Kafka consumer swap in-out create heavy churn in ZK state for kafka brokers
§ Every time a consumer enter/leaves, all consumers query the group state from ZK
§ Same for rolling upgrade for kafka, restarts, any bad behaviour by consumers
ZK - Kafka
Single Cluster
for Agg.
48. Re-queue Pipeline – Solution for Write Stability
48
Data Queue
6 Hrs
Requeue queue
24 Hrs
Kafka
Kafka
consumer
TSDB Async HBase lib HBase
UID Lookups
UID table unavailable
No response
NSRE
§ Region splits & hotspots
§ NSREs & GCs
§ Region unresponsive
§ Region unavailability
§ Load rebalancing
§ Region queue size max-
out
49. Lessons Learned
49
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
53. Auto Retries
53
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Failed/success
Given the additional job of handling the
removed / expired entry
Timed-out RPCs
54. Auto Retries
54
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
retry
Failed/success
Timed-out RPCs
Given the additional job of handling the
removed / expired entry
Put it back in cache
55. Auto Retries
55
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Given the additional job of
removing expired entry
retry
Failed/success
Stack Overflow!!
Timed-out RPCs
56. Auto Retries
56
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Overflow!!
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Response
✓ ✓
Timed-out RPCs
57. Auto Retries
57
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs
58. Auto Retries
58
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs
59. Auto Retries
59
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
Lock
✓ ✓
Timed-out RPCs
60. Auto Retries
60
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Response
As stack has unwinded to some extent,
we get space to call Unlock now
Lock
Lock
✓ ✓
Timed-out RPCs
61. Auto Retries
61
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Hangup !!
Thread
dies
Lock
Response
Lock
Lock
§ Thread is dead
§ 3 locks remaining
§ No thread can write/insert as the cache is locked
§ Guava cache hung, TSDB hung!!
✓ ✓
Timed-out RPCs
62. Lessons Learned
62
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
63. Broker 3
Broker 1
Storm and Kafka – Broker Slowness
63
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§ bolt thread writes to in-mem
kafka queue async
§ during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§ 133 topologies
§ 15 topics per
topology
§ 3 partitions per
topic
§ 3 TSDB topics
§ 222 partitions per
topic
§ 22 Kafka brokers
§ If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3
64. Broker 3
Broker 1
Storm and Kafka – Broker Slowness
64
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§ bolt thread writes to in-mem
kafka queue async
§ during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§ 133 topologies
§ 15 topics per
topology
§ 3 partitions per
topic
§ 3 TSDB topics
§ 222 partitions per
topic
§ 22 Kafka brokers
§ If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3
✓ Better Monitoring
65. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Writes from
producer
Reads from consumer
Storm and Kafka – Broker Slowness
66. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
U
N
U
S
E
D
Contents
swapped to disk
67. DiskJVM
OS
Page
Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
Maximize page
cache
U
N
U
S
E
D
68. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
Contents swapped back
from disk
GC kicks in for swapped
out objects
69. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
Contents swapped back
from disk
GC kicks in for swapped
out objects
Writes
High RPS pipeline will see heavy backpressure
and data will get dropped
VM.Swapiness
70. Lessons Learned
70
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5