Kafka Lambda architecture with mirroring

•Download as PPTX, PDF•

1 like•1,075 views

This document outlines a master plan for a lambda architecture that involves mirroring data from multiple Kafka clusters into a Hadoop cluster for batch processing and analytics, as well as real-time processing using Storm/Spark on the mirrored data in the Kafka clusters, with data from various sources integrated into the Kafka clusters with the topic name "Data".

Hadoop Cluster
Hadoop Cluster
Speed Layer
Storm/Spark (Real
time processing)
Batch Layer Analytics
ELT
Kafka Cluster (Mirror)
Mirroring
Kafka Cluster
Hadoop Client for
Camus
Hive/Pig
DWH
HBase
Master Plan (Lambda Architecture)
Variety Sources
REST/IS

Kafka Cluster A
REST/IS
REST/IS
Kafka Cluster B
Mirroring
Kafka Cluster C
Kafka Cluster D
Topic : “Data”
Topic : “Data”
Kafka (Mirroring) Configuration
Integration of all the sources
coming to cluster A and B with
topic Name “Data”
Integration of all the sources
coming to cluster A and B with
topic Name “Data”
Host Configuration
-Quad-Core AMD Opteron(TM)
Processor
-8GB RAM
-320Gb
Batch Layer (Writing the data
into hadoop cluster)
Speed Layer (Feeding data from
Kafka to Storm )

What's hot

Typesafe did a survey of Spark usage last year and found that a large percentage of Spark users combine it with Cassandra and Kafka. This talk focuses on streaming data scenarios that demonstrate how these three tools complement each other for building robust, scalable, and flexible data applications. Cassandra provides resilient and scalable storage, with flexible data format and query options. Kafka provides durable, scalable collection of streaming data with message-queue semantics. Spark provides very flexible analytics, everything from classic SQL queries to machine learning and graph algorithms, running in a streaming model based on "mini-batches", offline batch jobs, or interactive queries. We'll consider best practices and areas where improvements are needed.

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...

DataStax Academy

Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications. When does one use actors vs futures? Can we use Akka with, or in place of, Storm? How did we set up instrumentation and monitoring in production? How does one use VisualVM to debug Akka apps in production? What happens if the mailbox gets full? What is our Akka stack like? I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.

Akka in Production - ScalaDays 2015

Evan Chan

Streaming Big Data & Analytics For Scale

Helena Edelson

Muvr is a real-time personal trainer system. It must be highly available, resilient and responsive, and so it relies on heavily on Spark, Mesos, Akka, Cassandra, and Kafka—the quintuple also known as the SMACK stack. In this talk, we are going to explore the architecture of the entire muvr system, exploring, in particular, the challenges of ingesting very large volume of data, applying trained models on the data to provide real-time advice to our users, and training & evaluating new models using the collected data. We will specifically emphasize on how we have used Cassandra for consuming lots of fast incoming biometric data from devices and sensors, and how to securely access the big data sets from Cassandra in Spark to compute the models. We will finish by showing the mechanics of deploying such a distributed application. You will get a clear understanding of how Mesos, Marathon, in conjunction with Docker, is used to build an immutable infrastructure that allows us to provide reliable service to our users and a great environment for our engineers.

Real-time personal trainer on the SMACK stack

Anirvan Chakraborty

An Introduction to Distributed Search with Datastax Enterprise Search

Patricia Gorla

Sa introduction to big data pipelining with cassandra & spark west mins...

Simon Ambridge

We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Natalino Busa

Reactive dashboard’s using apache spark

Rahul Kumar

You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover: 1) The architecture we designed based on SDACK to support both batch and streaming workload. 2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing. 3) The Cassandra data model designed to support time series data writes and reads.

Using the SDACK Architecture to Build a Big Data Product

Evans Ye

You have collected a lot of time series data so now what? It's not going to be useful unless you can analyze what you have. Apache Spark has become the heir apparent to Map Reduce but did you know you don't need Hadoop? Apache Cassandra is a great data source for Spark jobs! Let me show you how it works, how to get useful information and the best part, storing analyzed data back into Cassandra. That's right. Kiss your ETL jobs goodbye and let's get to analyzing. This is going to be an action packed hour of theory, code and examples so caffeine up and let's go.

Analyzing Time Series Data with Apache Spark and Cassandra

Patrick McFadin

Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.

Reactive app using actor model & apache spark

Rahul Kumar

Are you tired of struggling with your existing data analytic applications? When MapReduce first emerged it was a great boon to the big data world, but modern big data processing demands have outgrown this framework. That’s where Apache Spark steps in, boasting speeds 10-100x faster than Hadoop and setting the world record in large scale sorting. Spark’s general abstraction means it can expand beyond simple batch processing, making it capable of such things as blazing-fast, iterative algorithms and exactly once streaming semantics. This combined with it’s interactive shell make it a powerful tool useful for everybody, from data tinkerers to data scientists to data developers.

The How and Why of Fast Data Analytics with Apache Spark

Legacy Typesafe (now Lightbend)

Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster. Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in. In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.

Real Time Data Processing Using Spark Streaming

Hari Shreedharan

Getting Started Running Apache Spark on Apache Mesos

Paco Nathan

Apache cassandra & apache spark for time series data

Patrick McFadin

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...

Brian O'Neill

This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...

DataStax Academy

Lambda Architecture Using SQL

SATOSHI TAGOMORI

Alpine academy apache spark series #1 introduction to cluster computing wit...

Holden Karau

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Lucidworks

What's hot (20)

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...

Akka in Production - ScalaDays 2015

Streaming Big Data & Analytics For Scale

Real-time personal trainer on the SMACK stack

An Introduction to Distributed Search with Datastax Enterprise Search

Sa introduction to big data pipelining with cassandra & spark west mins...

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Reactive dashboard’s using apache spark

Using the SDACK Architecture to Build a Big Data Product

Analyzing Time Series Data with Apache Spark and Cassandra

Reactive app using actor model & apache spark

The How and Why of Fast Data Analytics with Apache Spark

Real Time Data Processing Using Spark Streaming

Getting Started Running Apache Spark on Apache Mesos

Apache cassandra & apache spark for time series data

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...

Lambda Architecture Using SQL

Alpine academy apache spark series #1 introduction to cluster computing wit...

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Viewers also liked

Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software architectures still the right choice? Do we need new, revolutionary architectures to tackle the requirements of Big Data? This presentation discusses the idea of the so-called lambda architecture for Big Data, which acts on the assumption of a bisection of the data-processing: in a batch-phase a temporally bounded, large dataset is processed either through traditional ETL or MapReduce. In parallel, a real-time, online processing is constantly calculating the values of the new data coming in during the batch phase. The combination of the two results, batch and online processing is giving the constantly up-to-date view. This talk presents how such an architecture can be implemented using Oracle products such as Oracle NoSQL, Hadoop and Oracle Event Processing as well as some selected products from the Open Source Software community. While this session mostly focuses on the software architecture of BigData and FastData systems, some lessons learned in the implementation of such a system are presented as well.

Big Data and Fast Data - Lambda Architecture in Action

Guido Schmutz

Demystifying salesforce for developers

Heitor Souza

Following best practices can help ensure your success. This is especially true for Force.com applications or large Salesforce orgs that have the potential to push platform limits. Salesforce allows you to easily scale up from small to large amounts of data. Mostly this is seamless, but as data sets get larger, the time required for certain operations may grow too. Join us to learn different ways of designing and configuring data structures and planning a deployment process to significantly reduce deployment times and achieve operational efficiency. Watch this webinar to: :: Explore best practices for the design, implementation, and maintenance phases of your app's lifecycle. :: Learn how seemingly unrelated components can affect one another and determine the ultimate scalability of your app. :: See live demos that illustrate innovative solutions to tough challenges, including the integration of an external data warehouse using Force.com Canvas. :: Walk away with practical tips for putting best practices into action. Intended Audience This webinar is perfect for Salesforce or Force.com architects and developers that want to better understand data management best practices to ensure both short and long-term implementation success. Although many topics focus on large data volumes, the recommendations in this presentation are equally relevant to smaller orgs.

Extreme Salesforce Data Volumes Webinar

Salesforce Developers

How Apache Kafka is transforming Hadoop, Spark and Storm

Edureka!

Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...

Martin Zapletal

7가지 동시성 모델 람다아키텍처

Sunggon Song

Handling of Large Data by Salesforce

Thinqloud

As integrated web analytics evolves to both a service oriented and event based model, there will be higher emphasis on moving toward event based analytics. Business analytics is moving from purely counts of analytics to time-series, relationship and usage analytics. Examples of web analytics that can take advantage of this architecture are conversions analytics or cross channel marketing. The advantage of storing raw event data is that you have maximum flexibility for analysis. For example, you can trace the sequence of pages that one person visited over the course of their session. You can’t do that if you’ve squashed all the events into e.g. counters. That sort of analysis is really important for some offline processing tasks, such as training a recommender system (“people who bought X also bought Y”, that sort of thing). For such use cases, it’s best to simply keep all the raw events, so that you can later feed them all into your shiny new machine learning system. In this session we are going to elaborate on using Kafka, an Event Processing framework (e.g. Storm or Spark Streaming) and either Hadoop or EDW for building an Event Driven Architecture.

Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...

Data Con LA

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud

Streamsets Inc.

Machine learning at Scale with Apache Spark

Martin Zapletal

Salesforce REST API

Bohdan Dovhań

Join us for a deep dive into the architecture of the Salesforce1 Platform. We'll explain how multitenancy actually works and how it affects you as a Salesforce customer. By understanding the technology we use and the design principles we adhere to, you'll see how our platform teams manage three major upgrades a year without causing any issues to existing development. We'll cover the performance and security implications around the platform to give you an understanding of how limits have evolved. By the end of the session, you'll have a better grasp of the architecture underpinning Force.com and understand how to get the most out of it.

Understanding the Salesforce Architecture: How We Do the Magic We Do

Salesforce Developers

The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

Big Data Architectural Patterns and Best Practices on AWS

Amazon Web Services

Can you load 20 million records into Salesforce in under an hour? If not, this webinar is for you. You want to load tons of data into Salesforce. No problem, right? Just use the Bulk API and turn on parallel loading. Think again. Unless you carefully plan the big data loads that you want to break up into parallel operations to achieve maximum throughput, those loads can turn out more like slow, serial loads. In this webinar, Sean and Steve will teach you how to realize awesome throughput in your parallel data loads on the Salesforce1 Platform. After learning from the webinar's demos and code samples, you'll be able to apply your new deep knowledge of platform internals to measure load performance, recognize problems that slow your loads down, and work around these roadblocks. Key Takeaways :: Learn what parallelism is and how significant optimizing it is for performance :: Learn how to architect an integration or load tool to optimize parallelism, and obtain the maximum possible throughput :: Learn how to manage locks to avoid lock exceptions that can significantly reduce the throughput in your loads and integrations Intended Audience :: Salesforce architects or Force.com developers with a working understanding of data loading and integration concepts. A high-level understanding of the Bulk API and Java is also useful.

Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Salesforce Developers

Building a Lambda Architecture with Elasticsearch at Yieldbot

yieldbot

Introduction to Apache NiFi - Seattle Scalability Meetup

Saptak Sen

Microservice-based Architecture on the Salesforce App Cloud

pbattisson

Large Data Management Strategies

Salesforce Developers

람다아키텍처

HyeonSeok Choi

This is a talk given at ApacheCon 2015 If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community. Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!

Kafka at Scale: Multi-Tier Architectures

Todd Palino

Viewers also liked (20)

Big Data and Fast Data - Lambda Architecture in Action

Demystifying salesforce for developers

Extreme Salesforce Data Volumes Webinar

How Apache Kafka is transforming Hadoop, Spark and Storm

Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...

7가지 동시성 모델 람다아키텍처

Handling of Large Data by Salesforce

Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud

Machine learning at Scale with Apache Spark

Salesforce REST API

Understanding the Salesforce Architecture: How We Do the Magic We Do

Big Data Architectural Patterns and Best Practices on AWS

Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Building a Lambda Architecture with Elasticsearch at Yieldbot

Introduction to Apache NiFi - Seattle Scalability Meetup

Microservice-based Architecture on the Salesforce App Cloud

Large Data Management Strategies

람다아키텍처

Kafka at Scale: Multi-Tier Architectures

Kafka Lambda architecture with mirroring

1. Hadoop Cluster Hadoop Cluster Speed Layer Storm/Spark (Real time processing) Batch Layer Analytics ELT Kafka Cluster (Mirror) Mirroring Kafka Cluster Hadoop Client for Camus Hive/Pig DWH HBase Master Plan (Lambda Architecture) Variety Sources REST/IS

2. Kafka Cluster A REST/IS REST/IS Kafka Cluster B Mirroring Kafka Cluster C Kafka Cluster D Topic : “Data” Topic : “Data” Kafka (Mirroring) Configuration Integration of all the sources coming to cluster A and B with topic Name “Data” Integration of all the sources coming to cluster A and B with topic Name “Data” Host Configuration -Quad-Core AMD Opteron(TM) Processor -8GB RAM -320Gb Batch Layer (Writing the data into hadoop cluster) Speed Layer (Feeding data from Kafka to Storm )

Kafka Lambda architecture with mirroring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Kafka Lambda architecture with mirroring