SlideShare a Scribd company logo
1 of 45
Download to read offline
BigDataStreams
Architectures
Why?What?How?
Anton Nazaruk
CTO @ VITech+
BigDatain2016+?
BigDatain2016+?
● No more an exotic buzzword
● Mature enough and already adopted by majority of businesses/companies
● Set of well-defined tools and processes… questionable
● Data Analysis at scale - taking value from your data!
○ Prescriptive - reveals what action should be taken
○ Predictive - analysis of likely scenarios of what might happen
○ Diagnostic - past analysis, shows what had happened and why (classic)
○ Descriptive - real time analytics (stocks, healthcare..)
BigDataanalysischallenges
● Integration - ability to have needed data in needed place
● Latency - data have to be presented for processing immediately
● Throughput - ability to consume/process massive volumes of data
● Consistency - data mutation in one place must be reflected everywhere
● Teams collaboration - inconvenient interface for inter-teams
communication
● Technology adoption - typical technologies stack greatly complicates
entire project ecosystem - another world of hiring, deployment, testing,
scaling, fault tolerance, upgrades, monitoring, etc.
It’sachallenge!
Evolutionary
system
Solution
The Event LOG
What every software engineer should know about real-
time data's unifying abstraction
https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
TheeventLog
Reference
architecture
transitioncase
Unifiedorderedeventlog
Kafka
● Fast - single Kafka broker can handle hundreds of megabytes of reads
and writes per second from thousands of clients
● Scalable - can be elastically and transparently expanded without
downtime
● Durable - Messages are persisted on disk and replicated within the
cluster to prevent data loss. Each broker can handle terabytes of
messages without performance impact
● Reliable - has a modern cluster-centric design that offers strong
durability and fault-tolerance guarantees
Kafka-highlevelview
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
● Replica - up-to-date partition’s copy. Each partition is replicated across a
configurable number of servers for fault tolerance (like HDFS block)
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
● Replica - up-to-date partition’s copy. Each partition is replicated across a
configurable number of servers for fault tolerance (like HDFS block)
● Consumer - process that subscribes to topics and processes published
messages
Kafka-buildingblocks
● Consumer - process that subscribes to topics and processes published
messages
StreamProcessing-highlevel
Apache Storm
Apache Spark
Apache Samza
Apache Flink
Apache Flume
...
StreamProcessing-possibleimplementationframeworks
StreamProcessing-possibleimplementationframeworks
● Pros
○ Automatic fault tolerance
○ Scaling
○ No data loss guarantees
○ Stream processing DSL/SQL (joins, filters, count aggregates, etc)
● Cons
○ Overall system complexity significantly grows
■ New cluster to maintain/monitor/upgrade/etc (Apache Storm)
■ Multi-pattern (mixed) data access (Spark/Samza on YARN)
○ Another framework to learn for your team
StreamProcessing-microservices
StreamProcessing-microservices
Small, independent processes that communicate with each other to form
complex applications which utilize language-agnostic APIs.
These services are small building blocks, highly decoupled and focused on
doing a small task, facilitating a modular approach to system-building.
The microservices architectural style is becoming the standard for building
modern applications.
StreamProcessing-microservicescommunication
Three most commonly used protocols are :
● Synchronous request-response calls (mainly via HTTP REST API)
● Asynchronous (non blocking IO) request-response communication (Akka,
Play Framework, etc)
● Asynchronous messages buffers (RabbitMQ, JMS, ActiveMQ, etc)
StreamProcessing-microservicesplatforms
Microservices deployment platforms :
● Apache Mesos with a framework like Marathon
● Swarm from Docker
● Kubernetes
● YARN with something like Slider
● Various hosted container services such as ECS from Amazon
● Cloud Foundry
● Heroku
StreamProcessing-microservices
Why can’t I just package
and deploy my events
processing code on Yarn /
Mesos / Docker / Amazon
cluster and let it take care o
fault tolerance, scaling and
other weird things?
StreamProcessing-microservices
StreamProcessing-microservicescommunication
Fourth protocol is :
● Asynchronous, ordered and manageable logs of events - Kafka
StreamProcessing-newera(kafka&microservices)
StreamProcessing-kafka
● New Kafka Consumer 0.9.+
○ Light - consumer client is just a thin JAR without heavy 3rd party
dependencies (ZooKeeper, scala runtime, etc)
○ Acts as Load Balancer
○ Fault tolerant
○ Simple to use API
○ Kafka Streams - elegant DSL (should be officially released this
month)
StreamProcessing-kafka&microservices
StreamProcessing-kafka&microservices
1. Language agnostic logs of events (buffers)
2. No backpressure on consumers (API endpoints with sync approach)
3. Fault tolerance - no data loss
4. Failed service doesn’t bring entire chain down
5. Resuming from last committed offset position
6. No circuit breaker like patterns needed
7. Smooth configs management across all nodes and services
StreamProcessing-kafka&microservices
LambdaArchitecture
KappaArchitecture
KappaArchitecture
Architecturescomparison
Lambda Kappa
Processing paradigm Batch + Streaming Streaming
Re-processing paradigm Every batch cycles Only when code changes
Resource consumption Higher Lower
Maintenance/Support
complexity
Higher Lower
Ability to re-create dataset
Per any point of time
No (or very hard) Yes
Evenmoreinterestingcomparison
Hadoop-centric system Kafka-centric system
Data Replication + +
Fault Tolerance + +
Scaling + +
Random Reads With HBase With Elasticsearch/Solr
Ordered Reads - +
Secondary indices With Elasticsearch/Solr With Elasticsearch/Solr
Storage for Big Files (>10M) + -
TCO higher lower
Summary
1. Events Log centric system design - from chaos to structured
architecture
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
3. Microservices as distributed events processing approach
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
3. Microservices as distributed events processing approach
4. Kappa Architecture as Microservices & Kafka symbiosis
Usefullinks
1. “I heart Logs” by Jay Krepps http://shop.oreilly.
com/product/0636920034339.do
2. http://confluent.io/blog
3. https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
4. “Making sense of stream processing” by Martin Kleppmann
5. http://kafka.apache.org/
6. http://martinfowler.com/articles/microservices.html
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?

More Related Content

What's hot

The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 

What's hot (20)

The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 

Viewers also liked

Viewers also liked (20)

Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
Librecon 2016 bilbao: kappa architecture IoT of the cars
Librecon 2016 bilbao:   kappa architecture IoT of the carsLibrecon 2016 bilbao:   kappa architecture IoT of the cars
Librecon 2016 bilbao: kappa architecture IoT of the cars
 
High-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache ImpalaHigh-Performance Analytics in the Cloud with Apache Impala
High-Performance Analytics in the Cloud with Apache Impala
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Streaming Analytics - Comparison of Open Source Frameworks and Products
Streaming Analytics - Comparison of Open Source Frameworks and ProductsStreaming Analytics - Comparison of Open Source Frameworks and Products
Streaming Analytics - Comparison of Open Source Frameworks and Products
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 

Similar to Big Data Streams Architectures. Why? What? How?

Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
Dan Frincu
 

Similar to Big Data Streams Architectures. Why? What? How? (20)

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Captial One: Why Stream Data as Part of Data Transformation?
Captial One: Why Stream Data as Part of Data Transformation?Captial One: Why Stream Data as Part of Data Transformation?
Captial One: Why Stream Data as Part of Data Transformation?
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Recently uploaded (20)

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 

Big Data Streams Architectures. Why? What? How?

  • 3. BigDatain2016+? ● No more an exotic buzzword ● Mature enough and already adopted by majority of businesses/companies ● Set of well-defined tools and processes… questionable ● Data Analysis at scale - taking value from your data! ○ Prescriptive - reveals what action should be taken ○ Predictive - analysis of likely scenarios of what might happen ○ Diagnostic - past analysis, shows what had happened and why (classic) ○ Descriptive - real time analytics (stocks, healthcare..)
  • 4. BigDataanalysischallenges ● Integration - ability to have needed data in needed place ● Latency - data have to be presented for processing immediately ● Throughput - ability to consume/process massive volumes of data ● Consistency - data mutation in one place must be reflected everywhere ● Teams collaboration - inconvenient interface for inter-teams communication ● Technology adoption - typical technologies stack greatly complicates entire project ecosystem - another world of hiring, deployment, testing, scaling, fault tolerance, upgrades, monitoring, etc.
  • 7. Solution The Event LOG What every software engineer should know about real- time data's unifying abstraction https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying
  • 12. Kafka ● Fast - single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients ● Scalable - can be elastically and transparently expanded without downtime ● Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact ● Reliable - has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees
  • 14. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level
  • 15. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block)
  • 16. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block) ● Consumer - process that subscribes to topics and processes published messages
  • 17. Kafka-buildingblocks ● Consumer - process that subscribes to topics and processes published messages
  • 19. Apache Storm Apache Spark Apache Samza Apache Flink Apache Flume ... StreamProcessing-possibleimplementationframeworks
  • 20. StreamProcessing-possibleimplementationframeworks ● Pros ○ Automatic fault tolerance ○ Scaling ○ No data loss guarantees ○ Stream processing DSL/SQL (joins, filters, count aggregates, etc) ● Cons ○ Overall system complexity significantly grows ■ New cluster to maintain/monitor/upgrade/etc (Apache Storm) ■ Multi-pattern (mixed) data access (Spark/Samza on YARN) ○ Another framework to learn for your team
  • 22. StreamProcessing-microservices Small, independent processes that communicate with each other to form complex applications which utilize language-agnostic APIs. These services are small building blocks, highly decoupled and focused on doing a small task, facilitating a modular approach to system-building. The microservices architectural style is becoming the standard for building modern applications.
  • 23. StreamProcessing-microservicescommunication Three most commonly used protocols are : ● Synchronous request-response calls (mainly via HTTP REST API) ● Asynchronous (non blocking IO) request-response communication (Akka, Play Framework, etc) ● Asynchronous messages buffers (RabbitMQ, JMS, ActiveMQ, etc)
  • 24. StreamProcessing-microservicesplatforms Microservices deployment platforms : ● Apache Mesos with a framework like Marathon ● Swarm from Docker ● Kubernetes ● YARN with something like Slider ● Various hosted container services such as ECS from Amazon ● Cloud Foundry ● Heroku
  • 25. StreamProcessing-microservices Why can’t I just package and deploy my events processing code on Yarn / Mesos / Docker / Amazon cluster and let it take care o fault tolerance, scaling and other weird things?
  • 27. StreamProcessing-microservicescommunication Fourth protocol is : ● Asynchronous, ordered and manageable logs of events - Kafka
  • 29. StreamProcessing-kafka ● New Kafka Consumer 0.9.+ ○ Light - consumer client is just a thin JAR without heavy 3rd party dependencies (ZooKeeper, scala runtime, etc) ○ Acts as Load Balancer ○ Fault tolerant ○ Simple to use API ○ Kafka Streams - elegant DSL (should be officially released this month)
  • 31. StreamProcessing-kafka&microservices 1. Language agnostic logs of events (buffers) 2. No backpressure on consumers (API endpoints with sync approach) 3. Fault tolerance - no data loss 4. Failed service doesn’t bring entire chain down 5. Resuming from last committed offset position 6. No circuit breaker like patterns needed 7. Smooth configs management across all nodes and services
  • 36. Architecturescomparison Lambda Kappa Processing paradigm Batch + Streaming Streaming Re-processing paradigm Every batch cycles Only when code changes Resource consumption Higher Lower Maintenance/Support complexity Higher Lower Ability to re-create dataset Per any point of time No (or very hard) Yes
  • 37.
  • 38. Evenmoreinterestingcomparison Hadoop-centric system Kafka-centric system Data Replication + + Fault Tolerance + + Scaling + + Random Reads With HBase With Elasticsearch/Solr Ordered Reads - + Secondary indices With Elasticsearch/Solr With Elasticsearch/Solr Storage for Big Files (>10M) + - TCO higher lower
  • 39. Summary 1. Events Log centric system design - from chaos to structured architecture
  • 40. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation
  • 41. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach
  • 42. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach 4. Kappa Architecture as Microservices & Kafka symbiosis
  • 43. Usefullinks 1. “I heart Logs” by Jay Krepps http://shop.oreilly. com/product/0636920034339.do 2. http://confluent.io/blog 3. https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying 4. “Making sense of stream processing” by Martin Kleppmann 5. http://kafka.apache.org/ 6. http://martinfowler.com/articles/microservices.html