SlideShare a Scribd company logo
BigDataStreams
Architectures
Why?What?How?
Anton Nazaruk
CTO @ VITech+
BigDatain2016+?
BigDatain2016+?
● No more an exotic buzzword
● Mature enough and already adopted by majority of businesses/companies
● Set of well-defined tools and processes… questionable
● Data Analysis at scale - taking value from your data!
○ Prescriptive - reveals what action should be taken
○ Predictive - analysis of likely scenarios of what might happen
○ Diagnostic - past analysis, shows what had happened and why (classic)
○ Descriptive - real time analytics (stocks, healthcare..)
BigDataanalysischallenges
● Integration - ability to have needed data in needed place
● Latency - data have to be presented for processing immediately
● Throughput - ability to consume/process massive volumes of data
● Consistency - data mutation in one place must be reflected everywhere
● Teams collaboration - inconvenient interface for inter-teams
communication
● Technology adoption - typical technologies stack greatly complicates
entire project ecosystem - another world of hiring, deployment, testing,
scaling, fault tolerance, upgrades, monitoring, etc.
It’sachallenge!
Evolutionary
system
Solution
The Event LOG
What every software engineer should know about real-
time data's unifying abstraction
https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
TheeventLog
Reference
architecture
transitioncase
Unifiedorderedeventlog
Kafka
● Fast - single Kafka broker can handle hundreds of megabytes of reads
and writes per second from thousands of clients
● Scalable - can be elastically and transparently expanded without
downtime
● Durable - Messages are persisted on disk and replicated within the
cluster to prevent data loss. Each broker can handle terabytes of
messages without performance impact
● Reliable - has a modern cluster-centric design that offers strong
durability and fault-tolerance guarantees
Kafka-highlevelview
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
● Replica - up-to-date partition’s copy. Each partition is replicated across a
configurable number of servers for fault tolerance (like HDFS block)
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
● Replica - up-to-date partition’s copy. Each partition is replicated across a
configurable number of servers for fault tolerance (like HDFS block)
● Consumer - process that subscribes to topics and processes published
messages
Kafka-buildingblocks
● Consumer - process that subscribes to topics and processes published
messages
StreamProcessing-highlevel
Apache Storm
Apache Spark
Apache Samza
Apache Flink
Apache Flume
...
StreamProcessing-possibleimplementationframeworks
StreamProcessing-possibleimplementationframeworks
● Pros
○ Automatic fault tolerance
○ Scaling
○ No data loss guarantees
○ Stream processing DSL/SQL (joins, filters, count aggregates, etc)
● Cons
○ Overall system complexity significantly grows
■ New cluster to maintain/monitor/upgrade/etc (Apache Storm)
■ Multi-pattern (mixed) data access (Spark/Samza on YARN)
○ Another framework to learn for your team
StreamProcessing-microservices
StreamProcessing-microservices
Small, independent processes that communicate with each other to form
complex applications which utilize language-agnostic APIs.
These services are small building blocks, highly decoupled and focused on
doing a small task, facilitating a modular approach to system-building.
The microservices architectural style is becoming the standard for building
modern applications.
StreamProcessing-microservicescommunication
Three most commonly used protocols are :
● Synchronous request-response calls (mainly via HTTP REST API)
● Asynchronous (non blocking IO) request-response communication (Akka,
Play Framework, etc)
● Asynchronous messages buffers (RabbitMQ, JMS, ActiveMQ, etc)
StreamProcessing-microservicesplatforms
Microservices deployment platforms :
● Apache Mesos with a framework like Marathon
● Swarm from Docker
● Kubernetes
● YARN with something like Slider
● Various hosted container services such as ECS from Amazon
● Cloud Foundry
● Heroku
StreamProcessing-microservices
Why can’t I just package
and deploy my events
processing code on Yarn /
Mesos / Docker / Amazon
cluster and let it take care o
fault tolerance, scaling and
other weird things?
StreamProcessing-microservices
StreamProcessing-microservicescommunication
Fourth protocol is :
● Asynchronous, ordered and manageable logs of events - Kafka
StreamProcessing-newera(kafka&microservices)
StreamProcessing-kafka
● New Kafka Consumer 0.9.+
○ Light - consumer client is just a thin JAR without heavy 3rd party
dependencies (ZooKeeper, scala runtime, etc)
○ Acts as Load Balancer
○ Fault tolerant
○ Simple to use API
○ Kafka Streams - elegant DSL (should be officially released this
month)
StreamProcessing-kafka&microservices
StreamProcessing-kafka&microservices
1. Language agnostic logs of events (buffers)
2. No backpressure on consumers (API endpoints with sync approach)
3. Fault tolerance - no data loss
4. Failed service doesn’t bring entire chain down
5. Resuming from last committed offset position
6. No circuit breaker like patterns needed
7. Smooth configs management across all nodes and services
StreamProcessing-kafka&microservices
LambdaArchitecture
KappaArchitecture
KappaArchitecture
Architecturescomparison
Lambda Kappa
Processing paradigm Batch + Streaming Streaming
Re-processing paradigm Every batch cycles Only when code changes
Resource consumption Higher Lower
Maintenance/Support
complexity
Higher Lower
Ability to re-create dataset
Per any point of time
No (or very hard) Yes
Evenmoreinterestingcomparison
Hadoop-centric system Kafka-centric system
Data Replication + +
Fault Tolerance + +
Scaling + +
Random Reads With HBase With Elasticsearch/Solr
Ordered Reads - +
Secondary indices With Elasticsearch/Solr With Elasticsearch/Solr
Storage for Big Files (>10M) + -
TCO higher lower
Summary
1. Events Log centric system design - from chaos to structured
architecture
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
3. Microservices as distributed events processing approach
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
3. Microservices as distributed events processing approach
4. Kappa Architecture as Microservices & Kafka symbiosis
Usefullinks
1. “I heart Logs” by Jay Krepps http://shop.oreilly.
com/product/0636920034339.do
2. http://confluent.io/blog
3. https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
4. “Making sense of stream processing” by Martin Kleppmann
5. http://kafka.apache.org/
6. http://martinfowler.com/articles/microservices.html
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?

More Related Content

What's hot

The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
Big Data Spain
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
DataWorks Summit
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
boorad
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
Kumari Surabhi
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
Tung Nguyen
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
Nascenia IT
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
Dataconomy Media
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Denodo
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 

What's hot (20)

The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 

Viewers also liked

Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016
LibreCon
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Ahsan Javed Awan
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
Juantomás García Molina
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
Neeraj Sabharwal
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
Vinoth Chandar
 
Librecon 2016 bilbao: kappa architecture IoT of the cars
Librecon 2016 bilbao:   kappa architecture IoT of the carsLibrecon 2016 bilbao:   kappa architecture IoT of the cars
Librecon 2016 bilbao: kappa architecture IoT of the cars
Juantomás García Molina
 
High-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache ImpalaHigh-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache Impala
Cloudera, Inc.
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
Streaming Analytics - Comparison of Open Source Frameworks and Products
Streaming Analytics - Comparison of Open Source Frameworks and ProductsStreaming Analytics - Comparison of Open Source Frameworks and Products
Streaming Analytics - Comparison of Open Source Frameworks and Products
Kai Wähner
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Mathieu Dumoulin
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
Guido Schmutz
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 

Viewers also liked (20)

Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
Librecon 2016 bilbao: kappa architecture IoT of the cars
Librecon 2016 bilbao:   kappa architecture IoT of the carsLibrecon 2016 bilbao:   kappa architecture IoT of the cars
Librecon 2016 bilbao: kappa architecture IoT of the cars
 
High-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache ImpalaHigh-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache Impala
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Streaming Analytics - Comparison of Open Source Frameworks and Products
Streaming Analytics - Comparison of Open Source Frameworks and ProductsStreaming Analytics - Comparison of Open Source Frameworks and Products
Streaming Analytics - Comparison of Open Source Frameworks and Products
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 

Similar to Big Data Streams Architectures. Why? What? How?

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Captial One: Why Stream Data as Part of Data Transformation?
Captial One: Why Stream Data as Part of Data Transformation?Captial One: Why Stream Data as Part of Data Transformation?
Captial One: Why Stream Data as Part of Data Transformation?
ScyllaDB
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
Shivji Kumar Jha
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
jlacefie
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
Altinity Ltd
 

Similar to Big Data Streams Architectures. Why? What? How? (20)

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Captial One: Why Stream Data as Part of Data Transformation?
Captial One: Why Stream Data as Part of Data Transformation?Captial One: Why Stream Data as Part of Data Transformation?
Captial One: Why Stream Data as Part of Data Transformation?
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

Big Data Streams Architectures. Why? What? How?

  • 3. BigDatain2016+? ● No more an exotic buzzword ● Mature enough and already adopted by majority of businesses/companies ● Set of well-defined tools and processes… questionable ● Data Analysis at scale - taking value from your data! ○ Prescriptive - reveals what action should be taken ○ Predictive - analysis of likely scenarios of what might happen ○ Diagnostic - past analysis, shows what had happened and why (classic) ○ Descriptive - real time analytics (stocks, healthcare..)
  • 4. BigDataanalysischallenges ● Integration - ability to have needed data in needed place ● Latency - data have to be presented for processing immediately ● Throughput - ability to consume/process massive volumes of data ● Consistency - data mutation in one place must be reflected everywhere ● Teams collaboration - inconvenient interface for inter-teams communication ● Technology adoption - typical technologies stack greatly complicates entire project ecosystem - another world of hiring, deployment, testing, scaling, fault tolerance, upgrades, monitoring, etc.
  • 7. Solution The Event LOG What every software engineer should know about real- time data's unifying abstraction https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying
  • 12. Kafka ● Fast - single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients ● Scalable - can be elastically and transparently expanded without downtime ● Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact ● Reliable - has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees
  • 14. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level
  • 15. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block)
  • 16. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block) ● Consumer - process that subscribes to topics and processes published messages
  • 17. Kafka-buildingblocks ● Consumer - process that subscribes to topics and processes published messages
  • 19. Apache Storm Apache Spark Apache Samza Apache Flink Apache Flume ... StreamProcessing-possibleimplementationframeworks
  • 20. StreamProcessing-possibleimplementationframeworks ● Pros ○ Automatic fault tolerance ○ Scaling ○ No data loss guarantees ○ Stream processing DSL/SQL (joins, filters, count aggregates, etc) ● Cons ○ Overall system complexity significantly grows ■ New cluster to maintain/monitor/upgrade/etc (Apache Storm) ■ Multi-pattern (mixed) data access (Spark/Samza on YARN) ○ Another framework to learn for your team
  • 22. StreamProcessing-microservices Small, independent processes that communicate with each other to form complex applications which utilize language-agnostic APIs. These services are small building blocks, highly decoupled and focused on doing a small task, facilitating a modular approach to system-building. The microservices architectural style is becoming the standard for building modern applications.
  • 23. StreamProcessing-microservicescommunication Three most commonly used protocols are : ● Synchronous request-response calls (mainly via HTTP REST API) ● Asynchronous (non blocking IO) request-response communication (Akka, Play Framework, etc) ● Asynchronous messages buffers (RabbitMQ, JMS, ActiveMQ, etc)
  • 24. StreamProcessing-microservicesplatforms Microservices deployment platforms : ● Apache Mesos with a framework like Marathon ● Swarm from Docker ● Kubernetes ● YARN with something like Slider ● Various hosted container services such as ECS from Amazon ● Cloud Foundry ● Heroku
  • 25. StreamProcessing-microservices Why can’t I just package and deploy my events processing code on Yarn / Mesos / Docker / Amazon cluster and let it take care o fault tolerance, scaling and other weird things?
  • 27. StreamProcessing-microservicescommunication Fourth protocol is : ● Asynchronous, ordered and manageable logs of events - Kafka
  • 29. StreamProcessing-kafka ● New Kafka Consumer 0.9.+ ○ Light - consumer client is just a thin JAR without heavy 3rd party dependencies (ZooKeeper, scala runtime, etc) ○ Acts as Load Balancer ○ Fault tolerant ○ Simple to use API ○ Kafka Streams - elegant DSL (should be officially released this month)
  • 31. StreamProcessing-kafka&microservices 1. Language agnostic logs of events (buffers) 2. No backpressure on consumers (API endpoints with sync approach) 3. Fault tolerance - no data loss 4. Failed service doesn’t bring entire chain down 5. Resuming from last committed offset position 6. No circuit breaker like patterns needed 7. Smooth configs management across all nodes and services
  • 36. Architecturescomparison Lambda Kappa Processing paradigm Batch + Streaming Streaming Re-processing paradigm Every batch cycles Only when code changes Resource consumption Higher Lower Maintenance/Support complexity Higher Lower Ability to re-create dataset Per any point of time No (or very hard) Yes
  • 37.
  • 38. Evenmoreinterestingcomparison Hadoop-centric system Kafka-centric system Data Replication + + Fault Tolerance + + Scaling + + Random Reads With HBase With Elasticsearch/Solr Ordered Reads - + Secondary indices With Elasticsearch/Solr With Elasticsearch/Solr Storage for Big Files (>10M) + - TCO higher lower
  • 39. Summary 1. Events Log centric system design - from chaos to structured architecture
  • 40. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation
  • 41. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach
  • 42. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach 4. Kappa Architecture as Microservices & Kafka symbiosis
  • 43. Usefullinks 1. “I heart Logs” by Jay Krepps http://shop.oreilly. com/product/0636920034339.do 2. http://confluent.io/blog 3. https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying 4. “Making sense of stream processing” by Martin Kleppmann 5. http://kafka.apache.org/ 6. http://martinfowler.com/articles/microservices.html