At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences.
To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals.
Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth.
We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* "Streaming" over Data Lake using Kafka
Sinking data from a distributed message queue to files seems easy, right? Well, it is not trivial, but we managed to make it very complicated. At bol.com, the largest webshop of the Netherlands and Belgium, we started measuring customer behavior in real-time last year producing around 500,000 events per second. We ingest that data in Apache Kafka, a popular distributed message queue, and wanted to have that data stored in Parquet files for easier batch processing by different applications in the landscape. To do that we tried several solutions with Apache Flink, a stream processing framework, but transitioning from an event stream to windowed batch files under exactly-once guarantees proved to be very challenging. We struggled a lot with fault tolerance and, in the end, we decided to go back to basics and build a much simpler daily batch job instead of a streaming job. In this talk I’m going to share our struggle and learnings with you, especially about how (not) to deal with fault tolerance when you need exactly-once delivery guarantees.
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
The Central Bank of the Republic of Turkey is primarily responsible for steering the monetary and exchange rate policies in Turkey.
One of the major core functions of the Bank is market operations. In this context, analyzing and interpreting real-time tick data related to money market instruments has become not only a requirement but also a challenge.
For this use case, an API provided by one of the financial data vendors has been used to gather real-time tick data and data routing has been orchestrated by Apache NiFi.
Gathered data is being transferred to Kafka topics and then handed off to Druid for real-time indexing tasks.
Indicators such as effective cost, bid-ask spread, price impact measures, return reversal are calculated using Apache Storm and finally visualized by means of Apache Superset in order to provide decision-makers with a new set of tools.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Sinking data from a distributed message queue to files seems easy, right? Well, it is not trivial, but we managed to make it very complicated. At bol.com, the largest webshop of the Netherlands and Belgium, we started measuring customer behavior in real-time last year producing around 500,000 events per second. We ingest that data in Apache Kafka, a popular distributed message queue, and wanted to have that data stored in Parquet files for easier batch processing by different applications in the landscape. To do that we tried several solutions with Apache Flink, a stream processing framework, but transitioning from an event stream to windowed batch files under exactly-once guarantees proved to be very challenging. We struggled a lot with fault tolerance and, in the end, we decided to go back to basics and build a much simpler daily batch job instead of a streaming job. In this talk I’m going to share our struggle and learnings with you, especially about how (not) to deal with fault tolerance when you need exactly-once delivery guarantees.
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
The Central Bank of the Republic of Turkey is primarily responsible for steering the monetary and exchange rate policies in Turkey.
One of the major core functions of the Bank is market operations. In this context, analyzing and interpreting real-time tick data related to money market instruments has become not only a requirement but also a challenge.
For this use case, an API provided by one of the financial data vendors has been used to gather real-time tick data and data routing has been orchestrated by Apache NiFi.
Gathered data is being transferred to Kafka topics and then handed off to Druid for real-time indexing tasks.
Indicators such as effective cost, bid-ask spread, price impact measures, return reversal are calculated using Apache Storm and finally visualized by means of Apache Superset in order to provide decision-makers with a new set of tools.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
Data integration is a really difficult problem. We know this because 80% of the time in every project is spent getting the data you want the way you want it. We know this because this problem remains challenging despite 40 years of attempts to solve it. All we want is a service that will be reliable, handle all kinds of data and integrate with all kinds of systems, be easy to manage and scale as our systems grow. Oh, and it should be super low latency too. Is it too much to ask?
In this presentation, we’ll discuss the basic challenges of data integration and introduce few design and architecture patterns that are used to tackle these challenges. We will then explore how these patterns can be implemented using Apache Kafka. Difficult problems are difficult and we offer no silver bullets, but we will share pragmatic solutions that helped many organizations build fast, scalable and manageable data pipelines.
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Xiaomi is a Chinese technology company, it sells more than 100 million smartphones worldwide in 2018, and also owns one of the world's largest IoT device platforms. Xiaomi builds dozens of mobile apps and Internet services based on intelligent devices, including Ads, news feeds, finance service, game, music, video, personal cloud service and so on. The rapid growth of business results in exponential growth of the data analytics infrastructure. The amount of data has roared more than 20 times in the past 3 years, which renders us big challenges on the HDFS scalability
In this talk, we introduce how we scale HDFS to support hundreds of PB data with thousands nodes:
1. How Xiaomi use Hadoop and the characteristic of our usage
2. We made HDFS federation cluster to be used like a single cluster, most applications don't need to change any code to migrate from a single cluster to a federation cluster. Our works include a wrapper FileSystem compatible with DistributedFileSystem, supporting rename among different name spaces and zookeeper-based mount table renewer.
3. Experience of tuning NameNode to improve scalability
4. How to maintain hundreds of HDFS clusters and the optimization we did on client-side to make user and programs access these clusters easily with high performance
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:
Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
Going into different streaming methods, we will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will also present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* “Streaming” over Data Lake using Kafka
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
Data integration is a really difficult problem. We know this because 80% of the time in every project is spent getting the data you want the way you want it. We know this because this problem remains challenging despite 40 years of attempts to solve it. All we want is a service that will be reliable, handle all kinds of data and integrate with all kinds of systems, be easy to manage and scale as our systems grow. Oh, and it should be super low latency too. Is it too much to ask?
In this presentation, we’ll discuss the basic challenges of data integration and introduce few design and architecture patterns that are used to tackle these challenges. We will then explore how these patterns can be implemented using Apache Kafka. Difficult problems are difficult and we offer no silver bullets, but we will share pragmatic solutions that helped many organizations build fast, scalable and manageable data pipelines.
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Xiaomi is a Chinese technology company, it sells more than 100 million smartphones worldwide in 2018, and also owns one of the world's largest IoT device platforms. Xiaomi builds dozens of mobile apps and Internet services based on intelligent devices, including Ads, news feeds, finance service, game, music, video, personal cloud service and so on. The rapid growth of business results in exponential growth of the data analytics infrastructure. The amount of data has roared more than 20 times in the past 3 years, which renders us big challenges on the HDFS scalability
In this talk, we introduce how we scale HDFS to support hundreds of PB data with thousands nodes:
1. How Xiaomi use Hadoop and the characteristic of our usage
2. We made HDFS federation cluster to be used like a single cluster, most applications don't need to change any code to migrate from a single cluster to a federation cluster. Our works include a wrapper FileSystem compatible with DistributedFileSystem, supporting rename among different name spaces and zookeeper-based mount table renewer.
3. Experience of tuning NameNode to improve scalability
4. How to maintain hundreds of HDFS clusters and the optimization we did on client-side to make user and programs access these clusters easily with high performance
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:
Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
Going into different streaming methods, we will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will also present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* “Streaming” over Data Lake using Kafka
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
Netflix Open Source Meetup Season 4 Episode 2aspyker
In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix.
The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs.
The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis.
Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
The Netflix Way to deal with Big Data ProblemsMonal Daxini
Netflix is a data driven company with a unique culture. Come take a holistic tour of the Big Data ecosystem, and how Netflix culture catalyzes the development of systems. Then ogle at how we quickly evolved and scaled the event pipeline to a 1 trillion events per day and over 1.4 PB of event data without service disruption, and a small team.
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
Did you miss Scala Days 2015 in San Francisco? Have no fear! BoldRadius was there and we've compiled the best of the best! Here are the highlights of a great conference.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems: How to handle missing data?Should I handle both serving and backed process or separating them out? Best Performance for Money? In the talk we will tell the tale of all of the transformations we’ve made to our data model@Windward, some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL. And I’ll try my best NOT to answer the question “Which one of them is the Best?"
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
Spark RDDs are almost identical to Scala collection, just in a distributed manner, all of the transformations and actions are derived from the Scala collections API.
As Martin Odersky mentioned, “Spark - The Ultimate Scala Collections” is the right way to look at RDDs. But with that great distributed power comes a great many data problems: at first you’ll start tackling the concept of partitioning, then the actual data becomes the next thing to worry about.
In the talk we’ll go through an overview on Spark's architecture, and see how similar RDDs are to the Scala collections API. We'll then shift to the world of problems that you’ll be facing when using Spark for processing a vast volume of time-series data with multiple data stores (S3, MongoDB, Apache Cassandra, MySQL).
When you start tackling many scale and performance problems, many questions arise:
> How to handle missing data?
> Should the system handle both serving and backend processes, or should we separate them out?
> Which solution is cheaper?
> How do we get the best performance for money spent?
In the talk we will tell the tale of all of the transformations we’ve made to our data and review the multiple data persistency layers... and I’ll try my best NOT to answer the question “which persistency layer is the best?” but I do promise to share our pains and lessons learned!
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
http://www.oreilly.com/pub/e/3764
Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
3. Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Attended our session yesterday about counting
unique users with Druid?
● Working with Spark/Kafka? Planning to?
4. Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
● Data flow - past and present
● Spark Streaming
○ “Stateless” and “stateful” use-cases
● Spark Structured Streaming
● “Streaming” over our Data Lake
5. Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
6. Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campaign for young women who love tech
2. How many impressions a campaign received?
8. Data flow in the old days...
In-DB aggregation
OLAP
9. Data flow in the old days… What’s wrong with that?
● CSV-related issues, e.g:
○ Truncated lines in input files
○ Can’t enforce schema
● Scale-related issues, e.g:
○ Had to “manually” scale the processes
10. That's one small step for [a] man… (2014)
“Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015)
In-DB aggregation
OLAP
11. Why just a small step?
● Solved the scaling issues
● Still faced the CSV-related issues
12. Data flow - the modern way
+
Photography Copyright: NBC
14. The need for stateful streaming
Fast forward a few months...
●New requirements were being raised
●Specific use-case :
○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of
the aggregative operations to our Spark Streaming app
15. Stateful streaming via “local” aggregations
1.
Read Messages
5.
Upsert aggregated data
(every X micro-batches)
2.
Aggregate current micro-batch
3.
Write combined aggregated data 4.
Read aggregated data
From HDFS every X micro-batches
OLAP
16. Stateful streaming via “local” aggregations
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cluster is terminated and data on HDFS is lost?
● Complicates the code
○ Mixed input sources for the same app (Kafka + files)
● Possible performance impact
○ Might cause the Kafka consumer to lag
17. Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
●Enables running continuous, incremental processes
○ Basically manages the state for you
●Built on Spark SQL
○ DataFrame/Dataset API
○ Catalyst Optimizer
●Many other features
●Was in ALPHA mode in 2.0 and 2.1
Structured Streaming
18. Structured Streaming - stateful app use-case
2.
Aggregate current window
3.
Checkpoint (offsets and state) handled internally by Spark
1.
Read Messages
4.
Upsert aggregated data
(on window end)
Structured
streaming
OLAP
19. Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.org/jira/browse/SPARK-19517
○ https://issues.apache.org/jira/browse/SPARK-19677
○ https://issues.apache.org/jira/browse/SPARK-19407
● Checkpointing to S3 wasn’t straight-forward
○ Tried using EMRFS consistent view
■ Worked for stateless apps
■ Encountered sporadic issues for stateful apps
20. Structured Streaming - strengths and weaknesses (IMO)
● Strengths include :
○ Running incremental, continuous processing
○ Increased performance (e.g via Catalyst SQL optimizer)
○ Massive efforts are invested in it
● Weaknesses were mostly related to maturity
21. Back to the future - Spark Streaming revived for “stateful” app use-case
1.
Read Messages
3.
WriteFiles
2.
Aggregate Current micro-batch
4.
Load Data
OLAP
23. Cool, so… Why can’t we stop here? (cont.)
● Extreme load of Kafka brokers’ disks
○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory
● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration
○ Forced us to use 1 core per executor to avoid it
○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving
https://issues.apache.org/jira/browse/SPARK-22562 as well)
● We wish we could run it even less frequently
○ Remember - longer micro-batches result in a better aggregation ratio
24. Enter “streaming” over RDR
RDR (or Raw Data Repository) is our Data Lake
●Kafka topic messages are stored on S3 in Parquet format
●RDR Loaders - stateless Spark Streaming applications
●Applications can read data from RDR for various use-cases
○ E.g analyzing data of the last 30 days
Can we leverage our Data Lake and use it as the data source (instead of Kafka)?
25. How do we “stream” RDR files - producer side
S3 RDRRDR Loaders
2.
Write files
1.
Read Messages
3.
Write files’ paths
Topic with the files’ paths as messages
26. How do we “stream” RDR files - consumer side
S3 RDR
3.
Process files
1.
Read files’ paths
2.
Read RDR files
27. How do we use the new RDR “streaming” infrastructure?
1.
Read files’ paths
3.
Write files
2.
Read RDR files
OLAP
4.
Load Data
28. Did we solve the aforementioned problems?
● EMR clusters are now transient - no more idle clusters
Application type Day 1 Day 2 Day 3
Old Spark Streaming app 1007.68$ 1007.68$ 1007.68$
“Streaming” over RDR app 150.08$ 198.73$ 174.68$
29. Did we solve the aforementioned problems? (cont.)
● No more extreme load of Kafka brokers’ disks
○ We still read old messages from Kafka, but now we only read
about 1K messages per hour (rather than ~300M)
● The new infra doesn’t depend on the integration of Spark Streaming with Kafka
○ No more weird exceptions...
● We can run the Spark batch applications as (in)frequent as we’d like
30. Summary
● Initially replaced standalone Java with Spark & Scala
○ Still faced CSV-related issues
● Introduced Spark Streaming & Kafka for “stateless” use-cases
○ Quickly needed to handle stateful use-cases as well
● Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
○ Required us to manage the state on our own
● Moved to Structured Streaming (for all use-cases)
○ Cons were mostly related to maturity
31. Summary (cont.)
● Went back to Spark Streaming (with Druid as OLAP)
○ Performance penalty in Kafka for long micro-batches
○ Under-utilized Spark clusters
○ Etc.
● Introduced “streaming” over our Data Lake
○ Eliminated Kafka performance penalty
○ Spark clusters are much better utilized = $$$ saved
○ And more...
32. DRUID ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in Big Data.
■ To grow women representation in Big Data field > 25% by 2020
○ Visit the website (https://www.womeninbigdata.org/)
● Counting Unique Users in Real-Time: Here’s a Challenge for You!
○ Presented yesterday, http://tinyurl.com/yxjc72af
● NMC Tech Blog - https://medium.com/nmc-techblog
36. Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Data stream
Unbounded Table
New data in the data streamer
=
New rows appended to a unbounded table
Data stream as an unbonded table
38. Structured Streaming - WordCount example
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
39. Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes :
○ Append (default)
○ Complete
○ Update (added in Spark 2.1.1)
○ Different types of queries support different output modes
■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all
unaggregated data in the Result Table
● Output sinks :
○ File
○ Kafka (added in Spark 2.2.0)
○ Foreach
○ Console, Memory (for debugging)
○ Different types of sinks support different output modes
40. Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means :
○ Trackable sources (i.e offsets)
○ Checkpointing
○ Idempotent sinks
42. Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous architecture Old flow New architecture New flow
Existing
Spark app
Periodic Spark batch job Read Parquet from S3
-> Transform ->
Write Parquet to S3
Stateless Structured
Streaming
Read from Kafka ->
Transform ->
Write Parquet to S3
Existing Java
app
Periodic standalone Java
process (“manual”
scaling)
Read CSV ->
Transform and
aggregate -> Write to
RDBMS
Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS
New app N/A N/A Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS
Editor's Notes
Thank you for coming to hear about our different use-cases of Streaming with Spark and Kafka
I will try to make it interesting and valuable for you
Questions - at the end of the session
Nielsen marketing cloud or NMC in short
A group inside Nielsen,
Born from exelate company that was acquired by Nielsen on March 2015
Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate
Data company meaning
Buying and onboarding data into NMC from data providers, customers and Nielsen data
We have huge high quality dataset
enrich the data using machine learning models in order to create more relevant quality insights
categorize and sell according to a need
Helping brands to take intelligence business decisions
E.g. Targeting in the digital marketing world
Meaning help fit ads to viewers
For example street sign can fit to a very small % of people who see it vs
Online ads that can fit the profile of the individual that sees it
More interesting to the user
More chances he will click the ad
Better ROI for the marketer
What are the questions we try to answer in NMC that help our customers to take business decisions ?
A lot of questions but to lead to what druid is coming to solve
Translating from human problem to technical problem:
UU (distinct) count
Simple count
Few words on NMC data pipeline architecture:
Frontend layer:
Receives all the online and offline data traffic
Bare metal on different data centers (3 in US, 2 in EU ,3 in APAC)
near real time - high throughput/low latency challenges
Backend layer
Aws Cloud based
process all the frontend layer outputs
ETL’s - load data to data sources aggregated and raw
Applications layer
Also in the cloud
Variety of apps above all our data sources
Web - NMC
data configurations (segments, audiences etc)
campaign analysis , campaign management tools etc.
visualized profile graphs
reports
We’ve used Clustrix (our operation DB) as both OLTP and OLAP
Events are flowing from our Serving system, need to ETL the data into our data stores (DB, DWH, etc.)
Events were written to CSV files
Some fields had double quotes, e.g: 2014-07-17,12:55:38,2,2,0,"1619691,9995",1
Processing was done via standalone Java process
Had many problems with this architecture
Truncated lines in input files
Can’t enforce schema
Had to “manually” scale the processes
Around 2014 the standalone Java processes were transformed into Spark batch jobs written in Scala (but in this presentation we’re going to focus on streaming).
This is a simplified version of what we built (simplified it to make it clearer across the presentation)
Spark
A distributed, scalable engine for large-scale data processing
Unified framework for batch, streaming, machine learning, etc
Was gaining a lot of popularity in the Big Data community
Built on RDDs (Resilient distributed dataset)
A fault-tolerant collection of elements that can be operated on in parallel
Scala
Combines object-oriented and functional programming
First-class citizen is Spark
Kafka
Open-source stream-processing platform
Highly scalable
Publish/Subscribe (A.K.A pub/sub)
Schema enforcement - using Schema Registry and relying on Avro format
Much more
Originally developed by LinkedIn
Graduated from Apache Incubator on late 2012
Quickly became the de facto standard in the industry
Today commercial development is led by Confluent
Spark Streaming
A natural evolvement of our Spark batch job (unified framework – remember?)
Introduced the DStream concept
Continuous stream of data
Represented by a continuous series of RDDs
Works in micro-batches
Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)
We started with Spark Streaming over Kafka (in 2015)
Our Streaming apps were “stateless” (see below) and running 24/7 :
Reading a batch of messages from Kafka
Performing simple transformations on each message (no aggregations)
Writing the output of each batch to a persistent storage (DB, S3, etc.)
Stateful operations (aggregations) were performed periodically in batch either by
Spark jobs
ETLs in our DB/DWH
Looking back, Spark Streaming might have been able to perform stateful operations for us, but (as far as I recall) mapWithState wasn’t available yet, and updateStateByKey had some pending issues.
The way to achieve it was :
Read messages from Kafka
Aggregate the messages of the current micro-batch
Increased micro-batch length to achieve a better aggregation ratio
Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS)
Write the results back to HDFS
Every X batches :
Update the DB with the aggregated data (some sort of UPSERT)
Delete the aggregated files from HDFS
UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL)
For example, given t1 with columns a (the key) and b (starting from an empty table)
INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2
INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7
In this specific use-case, the app was reading from a topic which had only small amounts of data
Required us to manage the state on our own
Error-prone
E.g what if my cluster is terminated and data on HDFS is lost?
Complicates the code
Mixed input sources for the same app (Kafka + files)
Possible performance impact
Might cause the Kafka consumer to lag
Obviously not the perfect way (but that’s what we had…)
DataFrame/Dataset - rather than DStream’s RDD
Catalyst Optimizer - extensible query optimizer which is “at the core of Spark SQL… designed with these key two purposes:
Easily add new optimization techniques and features to Spark SQL
Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc.)” (see https://databricks.com/glossary/catalyst-optimizer)
Other features included :
Handling event-time and late data
End-to-end exactly-once fault-tolerance
Checkpoint folder is the location Spark stores :
The offsets we already read from Kafka
The state of the stateful operations (e.g aggregations)
We’ve used S3 (via EMRFS) for checkpointing.
We’ve deployed to production various use-cases using Structured Streaming :
Periodic Spark batch job was converted to a stateless Structured Streaming app
Periodic standalone Java app was converted to a stateful Structured Streaming app
A brand new app was written as a stateful Structured Streaming app
EMRFS consistent view - an optional feature on AWS EMR, allows clusters to check for list and read-after-write consistency for S3 objects written by or synced with EMRFS
Checkpointing to S3 wasn’t straight-forward
Try using EMRFS consistent view
Recommended for stateless apps
For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e DynamoDB)
Strengths :
Running incremental, continuous processing
End-to-end exactly-once fault-tolerance (if you implement it correctly)
Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code generation)
Massive efforts are invested in it
Weaknesses :
Maturity
Inability to perform multiple actions on the exact same Dataset
E.g http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-Avoiding-multiple-streaming-queries-tt30944.html
Seems to be resolved by https://issues.apache.org/jira/browse/SPARK-24565 (in Spark 2.4, but then you get at-least once)
Moved many apps (mostly the ones performing UU counts and “hit” counts) to rely on Druid, which is meant for OLAP, so now :
Spark Streaming app
Runs on a long-lived EMR cluster (cluster is on 24/7)
Performs the “in-batch” aggregation per micro-batch (before writing to S3)
Writes relevant metadata to RDS (e.g S3 path)
This kind of “split” (i.e persisting the Dataset/DataFrame and iterating it a few times) is impossible with Structured Streaming (where every “branch” of processing is a separate query, at least until https://issues.apache.org/jira/browse/SPARK-24565)
M/R ingestion job (loads data into Druid) :
Reads relevant metadata from RDS
performs the final aggregation (before data is loaded into Druid)
Update state in RDS (e.g which files were handled)
Screenshot from Ganglia installed on our AWS EMR cluster running the Spark Streaming app
Remember - longer micro-batches result in a better aggregation ratio
Each such app runs on its own long-lived EMR cluster
Extreme load of Kafka brokers’ disks
Each micro-batch needs to read ~300M messages , Kafka can’t store it all in memory
ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration
Forced us to use 1 core per executor to avoid it
https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well)
We wish we could run it even less frequently
Remember - longer micro-batches result in a better aggregation ratio
Each Kafka topic has its own RDR loader, which stores the data in a separate bucket on S3 (partitioned by date)
This means each topic has only 1 consumer (the appropriate RDR loader)
RDR loaders use micro-batches of 4-6 minutes, writing about 100 files per micro-batch, each file is ~0.5GB (to allow efficient read)
Only simple transformations on each message (no aggregations)
Hence no need for long micro-batche
Once the RDR loader writes the files to S3, it also writes the files’ paths to a designated topic in Kafka
How does that work?
Spark batch applications are executed every X hours
On each execution :
A transient EMR cluster is launched
An app consumes the next Y messages from the designated Kafka topic (containing Y paths).Since each such app is consuming from Kafka is a standard way, the offsets (i.e which messages the consumer already read) are committed (and maintained) the same way as we do for any other Kafka consumer
Then the app reads those Y paths from RDR and processes them
Once done, the EMR cluster is terminated
Applications are now batch rather than streaming
We no longer use Spark Streaming-Kafka integration, but rather Kafka API (from the driver) to read the files’ paths from the designated Kafka topic
Once we got the paths from Kafka, we use the “regular” batch method of reading files, i.e spark.read.parquet
After processing has ended, offsets of the messages we read are committed (as we’d do for any Kafka consumer)
We now use Airflow (the de facto standard in the industry) to schedule and monitor our batch jobs
All this obviously is not meant to be used by apps that require actual real time (say milliseconds)
EMR clusters are now transient, so the cluster is terminated as soon as the batch job has finished - no more idle clusters
Cost:
Spark Streaming cluster is on 24/7, so the cost is fixed
With the new infra, the daily cost varies based on the amount of data we processed that day
Initially replaced standalone Java with Spark & Scala
Solved the scale-related issues but not the CSV-related issues
Introduced Spark Streaming & Kafka for “stateless” use-cases
Replaced CSV files with Kafka (de facto standard in the industry)
Already had Spark batch in production (Spark as a unified framework)
Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
Not the optimal solution
Moved to Structured Streaming (for all use-cases)
Pros include :
Enables running continuous, incremental processes
Built on Spark SQL
Cons include :
Maturity
Inability to perform multiple actions on the exact same Dataset
Went back to Spark Streaming
Aggregations are done per micro-batch (in Spark) and daily (in Druid)
Still not perfect
Performance penalty in Kafka for long micro-batches
Concurrency issue with Kafka 0.10 consumer in Spark
Under-utilized Spark clusters
Introduced “streaming” over our Data Lake
Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka
Spark batch apps read S3 paths from Kafka (and the actual files from S3)
Transient EMR clusters
Airflow for scheduling and monitoring
Pros :
Eliminated the performance penalty we had in Kafka
Spark clusters are much better utilized = $$$ saved
“The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended… You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table”
“A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.”