Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora

•

6 likes•1,864 views

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

Shriya Arora
SeniorData Engineer
Personalization Analytics
Streaming datasets for
Personalization

What is Netflix’s Mission?
Entertaining you by allowing you to stream
content anywhere, anytime

What is Netflix’s Mission?
Entertaining you by allowing you to stream personalized
content anywhere, anytime

How much data do we process to have a personalized Netflix for everyone?
● 93M+ active members
● 125M hours/ day
● 190 countries with
unique catalogs
● 450B unique events/day
● 600+ Kafka topics
Image credit:http://www.bigwisdom.net/

Data Infrastructure
Rawdata
(S3/hdfs)
Stream
Processing
(Spark, Flink …)
Processed data
(Tables/Indexers)
Batch processing
(Spark/Pig/Hive/MR)
Application instances
Keystone
Ingestion
Pipeline
Rohan’stalk
tomorrow
@12:20

User watches a
video on Netflix
Data flows through
Netflix Servers
Our problem: Using user plays for feature generation, discovery, clustering ..

Why have data later when you can have it now?
● Business Wins
○ Algorithms can be trained with the latest + greatest data
○ Enhances research
○ Creates opportunity for new types of algorithms
● Technical Wins
○ Save on storage costs
○ Avoid long running jobs

Source-of-Discovery pipeline
Playback
sessions
Spark- Streaming app
Discovery
service
REST API
Video Metadata
Backup

Spark Streaming
● Needs a StreamingContext and a batch duration
● Data received in DStreams, which are easily converted to RDDs
● Support all fundamental RDD transformations and operations
● Time-based windowing
● Checkpointing support for resilience to failures
● Deployment

Performance tuning your Spark streaming application

Performance tuning your Spark streaming application
● Choice of micro-batch interval
○ The most important parameter
● Cluster memory
○ Large batch intervals need more memory
● Parallelism
○ DStreams naturally partitioned to Kafka partitions
○ Repartition can help with increased parallelism at the cost of shuffle
● # of CPUs
○ <= number of tasks
○ Depends on howcomputationally intensive your processing is

Challenges with Spark
● Not a ‘pure’ event streaming system
○ Minimum latency of batchinterval
○ Un-intuitive to design for a stream-only world
● Choice of batch interval is a little too critical
○ Everythingcan go wrong, if you choose this wrong
○ Build-up of schedulingdelaycan leadto data loss
● Only time-based windowing*
○ Cannot be used to solve session-stitching use cases, or trigger based event aggregations
* I used 1.6.1

Challenges with Streaming
● Pioneer Tax
○ Training ML models on streaming data
is new ground
● Increased criticality of outages
○ Batch failures have to be addressed
urgently, Streaming failureshave to be
addressed immediately.
● Infrastructure investment
○ Monitoring, Alerts,
○ Non-trivial deployments
There are two kinds
of pain...

Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered and written out in real-time with an ease matching that of batch processing. However the real challenge of matching the prowess of batch ETL remains in doing joins, in maintaining state and to have the data be paused or rested dynamically. Netflix has a microservices architecture. Different microservices serve and record different kind of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when we want to combine the events coming from one high-traffic microservice to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations. Historically we have done this joining of large volume data-sets in batch. However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real time? Why wait a full day to get information from an event that was generated a few mins ago? In this talk, we will share how we solved a complex join of two high-volume event streams using Flink. We will talk about maintaining large state, fault tolerance of a stateful application and strategies for failure recovery.

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...

Flink Forward

Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.

Introduction to Apache Flink

datamantra

Apache Flink: Real-World Use Cases for Streaming Analytics

Slim Baltagi

This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications. In this talk, you learn more about: 1. What is Apache Flink Stack? 2. Batch vs. Streaming Analytics 3. Key Differentiators of Apache Flink for Streaming Analytics 4. Real-World Use Cases with Flink for Streaming Analytics 5. Who is using Flink? 6. Where do you go from here?

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...

Flink Forward

Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups. The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead. Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Databricks

As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

Stream Processing – Concepts and Frameworks

Guido Schmutz

More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store in a data sink of choice and later analyze it. Storing even a high-volume event stream is feasible and not a challenge anymore. But this adds to the end-to-end latency and it takes minutes if not hours to present results. If you need to react fast, you simply can’t afford to first store the data. You need to do process it directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.

Intel dpdk Tutorial

Saifuddin Kaijar

Most databases are based on architectures that pre-date advances to modern hardware. This results in performance issues, the need to overprovision, and a high total cost of ownership. In this webinar we will discuss the advances to modern server technology and take a deep dive into Scylla’s shard-per-core architecture and our asynchronous engine, the Seastar framework. Join us to learn how Seastar (and Scylla): Avoid locks and contention on the CPU level Bypass kernel bottlenecks Implement its per-core shared-nothing autosharding mechanism Utilize modern storage hardware Leverage NUMA to get the best RAM performance Balance your data across CPUs and nodes for best and smoothest performance Plus we’ll cover the advantages of unlocking vertical scalability.

Déploiement ELK en conditions réelles

Geoffroy Arnoud

Splunk Data Onboarding Overview - Splunk Data Collection Architecture

Splunk

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters

Databricks

Adding nodes at runtime (Upscale) to already running Spark-on-Yarn clusters is fairly easy. But taking away these nodes (Downscale) when the workload is low at some later point of time is a difficult problem. To remove a node from a running cluster, we need to make sure that it is not used for compute as well as storage. But on production workloads, we see that many of the nodes can’t be taken away because: Nodes are running some containers although they are not fully utilized i.e., containers are fragmented on different nodes. Example. – each node is running 1-2 containers/executors although they have resources to run 4 containers. Nodes have some shuffle data in the local disk which will be consumed by Spark application running on this cluster later. In this case, the Resource Manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages. In this talk, we will talk about how we can improve downscaling in Spark-on-YARN clusters under the presence of such constraints. We will cover changes in scheduling strategy for container allocation in YARN and Spark task scheduler which together helps us achieve better packing of containers. This makes sure that containers are defragmented on fewer set of nodes and thus some nodes don’t have any compute. In addition to this, we will also cover enhancements to Spark driver and External Shuffle Service (ESS) which helps us to proactively delete shuffle data which we already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data – thus freeing them from storage and hence available for reclamation for faster downscaling.

Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi

DataWorks Summit

Apache NiFi provided a revolutionary data flow management system with a broad range of integrations with existing data production, consumption, and analysis ecosystems, all covered with robust data delivery and provenance infrastructure. Now learn about the follow-on project which expands the reach of NiFi to the edge, Apache MiNiFi. MiNiFi is a lightweight application which can be deployed on hardware orders of magnitude smaller and less powerful than the existing standard data collection platforms. With both a JVM compatible and native agent, MiNiFi allows data collection in brand new environments — sensors with tiny footprints, distributed systems with intermittent or restricted bandwidth, and even disposable or ephemeral hardware. Not only can this data be prioritized and have some initial analysis performed at the edge, it can be encrypted and secured immediately. Local governance and regulatory policies can be applied across geopolitical boundaries to conform with legal requirements. And all of this configuration can be done from central command & control using an existing NiFi with the trusted and stable UI data flow managers already love. Expected prior knowledge / intended audience: developers and data flow managers should have passing knowledge of Apache NiFi as a platform for routing, transforming, and delivering data through systems (a brief overview will be provided). The talk will focus on extending the data collection, routing, provenance, and governance capabilities of NiFi to IoT/edge integration via MiNiFi. Speaker Andy LoPresto, Sr. Member of Technical Staff, Hortonworks

How netflix manages petabyte scale apache cassandra in the cloud

Vinay Kumar Chella

At Netflix, we manage petabytes of data in Apache Cassandra which must be reliably accessible to users in mere milliseconds. To achieve this, we have built sophisticated control planes that turn our persistence layer based on Apache Cassandra into a truly self-driving system. We will start with the user interface that Netflix developers use to interact with their Cassandra databases and dive deep into the automation that powers it all. From cluster creation, through scaling up, to cluster death, complex automation drives large fleets of virtual machines hosted on the AWS cloud. First, we will cover the basics of how Netflix deploys Apache Cassandra. In particular, this begins with how we mold Apache Cassandra to the Netflix philosophy of immutable infrastructure, including managing software and hardware upgrades in the face of ever-failing hardware. Then we will explore the concrete techniques needed for such a massive deployment, specifically pull-based control planes and auto-healing strategies. Next, we will cover how Netflix has automated complex but critical Apache Cassandra maintenance tasks such as continuous snapshot backups and always-on anti-entropy repair for keeping our datasets safe and consistent. Both of these systems have gone through multiple architectural evolutions, and we have learned many lessons along the way. Lastly, we will share some of the ways this has gone wrong, and what you can do to avoid them. We will cover a few case studies of major Cassandra outages at Netflix, their root cause, and what we learned from those incidents. At the end of this talk, we hope that participants leave with concrete understanding of the challenges in running massive scale Apache Cassandra as well as solid advice and techniques for building their own self-driving data persistence layer.

[OpenStack 스터디] OpenStack With Contrail

OpenStack Korea Community

Running Apache NiFi with Apache Spark : Integration Options

Timothy Spann

Near real-time statistical modeling and anomaly detection using Flink!

Flink Forward

Flink Forward San Francisco 2022. At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale! by Kunal Umrigar & Balint Kurnasz

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique

ScyllaDB

Optimizing for performance and reducing latency is a hard problem. Examples could be: choosing a different algorithm and data structures, improving SQL queries, adding a cache, serving requests asynchronously, or some low-level optimization that requires a deep understanding of the OS, kernel, compiler, or the network stack. The engineering effort is usually nontrivial, and only if you're lucky, you'll see some tangible results. That being said, there are some performance optimization techniques, with a few lines of code — even exist in the built-in library — it can lead to noticeable surprising results. One of these techniques is to "fail fast, retry soon". These techniques are often neglected or taken for granted. In distributed systems, a service or a database consists of a fleet of nodes that functions as one unit. It is not uncommon for some nodes to go down, usually, for a short time. When this occurs, failures can happen on the client-side and can lead to an outage. To build resilient systems, and reduce the probability of failure, we're going to explore these topics: timeouts, backoff, and jitter. We'll talk about timeouts, what timeout to set, pitfalls of retries, how backoff improves resource utilization, and jitters reduce congestion. Furthermore, we're going to see an adaptive mechanism to dynamically adjust these configurations. This is inspired by a real-production use case where DynamoDB latency p99 & max went down from > 10s to ~500ms after employing these three techniques: timeouts, backoff, and jitter. This is inspired by a real-production use case where DynamoDB latency p99 & max went down from > 10s to ~500ms. AWS articles, specifically M. Brooker’s writings, and SDKs code have been great resources to dive into these techniques: - Timeouts, retries and backoff with jitter in the AWS Builder's Library, 2019 (https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) - Exponential Backoff and Jitter on the AWS Architecture Blog, 2016 (https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/) - Fixing retries with token buckets and circuit breakers, Marc's Blog, 2022 (https://brooker.co.za/blog/2022/02/28/retries.html)

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Databricks

Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going examine a number common streaming design patterns in the context of the following questions. WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements? WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements? HOW are going to architect the solution? And how much are you willing to pay for it? Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

The basics of fluentd

Treasure Data, Inc.

Processing Semantically-Ordered Streams in Financial Services

Flink Forward

Flink Forward San Francisco 2022. What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases. by Patrick Lucas

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Altinity Ltd

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems. Presenters: Mary Grygleski - Streaming Developer Advocate & Mark Needham - Developer Relations Engineer at StarTree Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).

Cloud computing and OpenStack

Edgar Magana

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf

Ververica

The need to enrich a fast, high volume data stream with slow-changing reference data is probably one of the most wide-spread requirements in stream processing applications. Apache Flink's built-in join functionalities and its flexible lower-level APIs support stream enrichment in various ways depending on the specific requirements of the use case at hand. In this webinar, I like to provide an overview of the basic methods to enrich a data stream with Apache Flink and highlight use cases, limitations, advantages and disadvantages of each.

Pulsar Storage on BookKeeper _Seamless Evolution

StreamNative

Apache Pulsar has a distinct architecture from other messaging systems. There is a clear separation of the compute layer that does message processing and dispatching, from the storage layer that handles persistent message storage, using Apache Bookkeeper. This separation of concerns leads to a very efficient design, in terms of performance and cost. Messaging systems that provide guaranteed delivery, when used in production use cases, impose on the underlying storage, demands that are very different from simple benchmark scenarios that test write throughput. Pulsar, with both I/O isolation and separation of concerns, performs better than other messaging systems in production use cases. The strategy of I/O isolation provides better performance from each storage node at less cost, and the separation between computing and storage means that compute nodes can be scaled independently from storage. Irrespective of the choice of storage, Pulsar can be configured to get the best performance for any of those storage configurations. This paper also discusses how some of the latest technologies like NVMe and Persistent Memory can be leveraged at a very low cost overhead, by Pulsar, without any architectural or design changes, with some data from real use cases. The fundamental choice of using Bookkeeper as the storage layer for Pulsar is validated from our experience.

Unified Stream and Batch Processing with Apache Flink

DataWorks Summit/Hadoop Summit

What is Apache Kafka and What is an Event Streaming Platform?

confluent

Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...

Spark Summit

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Spark Summit

What's hot

Under the Hood of a Shard-per-Core Database Architecture

ScyllaDB

Déploiement ELK en conditions réelles

Geoffroy Arnoud

Splunk Data Onboarding Overview - Splunk Data Collection Architecture

Splunk

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters

Databricks

Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi

DataWorks Summit

How netflix manages petabyte scale apache cassandra in the cloud

Vinay Kumar Chella

[OpenStack 스터디] OpenStack With Contrail

OpenStack Korea Community

Running Apache NiFi with Apache Spark : Integration Options

Timothy Spann

Near real-time statistical modeling and anomaly detection using Flink!

Flink Forward

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique

ScyllaDB

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Databricks

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

The basics of fluentd

Treasure Data, Inc.

Processing Semantically-Ordered Streams in Financial Services

Flink Forward

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Altinity Ltd

Cloud computing and OpenStack

Edgar Magana

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf

Ververica

Pulsar Storage on BookKeeper _Seamless Evolution

StreamNative

Unified Stream and Batch Processing with Apache Flink

DataWorks Summit/Hadoop Summit

What is Apache Kafka and What is an Event Streaming Platform?

confluent

What's hot (20)

Under the Hood of a Shard-per-Core Database Architecture

Déploiement ELK en conditions réelles

Splunk Data Onboarding Overview - Splunk Data Collection Architecture

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters

Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi

How netflix manages petabyte scale apache cassandra in the cloud

[OpenStack 스터디] OpenStack With Contrail

Running Apache NiFi with Apache Spark : Integration Options

Near real-time statistical modeling and anomaly detection using Flink!

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Apache Spark in Depth: Core Concepts, Architecture & Internals

The basics of fluentd

Processing Semantically-Ordered Streams in Financial Services

Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot

Cloud computing and OpenStack

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf

Pulsar Storage on BookKeeper _Seamless Evolution

Unified Stream and Batch Processing with Apache Flink

What is Apache Kafka and What is an Event Streaming Platform?

Viewers also liked

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...

Spark Summit

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Spark Summit

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...

Spark Summit

Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark. Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. At BlueData, we’re “all in” on Docker containers – with a specific focus on Spark applications. We’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker. In this session, you’ll learn about networking Docker containers across multiple hosts securely. We’ll discuss ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So we’ll discuss some of the storage options we explored and implemented at BlueData to achieve near bare-metal I/O performance for Spark using Docker. We’ll share our lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Summit

Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search. Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.

Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...

Spark Summit

We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...

Spark Summit

So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions: How do I manage offsets? How do I manage state? How do I make my spark streaming job resilient to failures? Can I avoid some failures? How do I gracefully shutdown my streaming job? How do I monitor and manage (e.g. re-try logic) streaming job? How can I better manage the DAG in my streaming job? When to use checkpointing and for what? When not to use checkpointing? Do I need a WAL when using streaming data source? Why? When don’t I need one? In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...

Spark Summit

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

Spark Summit

Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.

Tuning and Monitoring Deep Learning on Apache Spark

Databricks

Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs. Speaker: Tim Hunter This talk was originally presented at Spark Summit East 2017.

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...

Spark Summit

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. If your data is already in one of Spark Streaming’s well-supported message queuing systems, this is easy. If not, an ad hoc solution to import data may work for a single application, but trying to scale that approach to complex data pipelines integrating dozens of data sources and sinks with multi-stage processing quickly breaks down. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier. This talk will first describe some data pipeline anti-patterns we have observed and motivate the need for a tool designed specifically to bridge the gap between other data systems and stream processing frameworks. We will introduce Kafka Connect, starting with basic usage, its data model, and how a variety of systems can map to this model. Next, we’ll explain how building a tool specifically designed around Kafka allows for stronger guarantees, better scalability, and simpler operationalization compared to other general purpose data copying tools. Finally, we’ll describe how combining Kafka Connect and Spark Streaming, and the resulting separation of concerns, allows you to manage the complexity of building, maintaining, and monitoring large scale data pipelines.

Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...

Spark Summit

Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies. Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML. This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Spark Summit

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.

Viewers also liked (12)