SlideShare a Scribd company logo
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns For Real Time Streaming
Data Analytics
15 Apr 2015
Sheetal Dolas
Principal Architect, Hortonworks
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Who am I ?
• Principal Architect @ Hortonworks
• Most of the career has been in field, solving real life
business problems
• Last 5+ years in Big Data including Hadoop, Storm etc.
• Co-developed Cisco OpenSOC ( http://opensoc.github.io
)
sheetal@hortonworks.com
@sheetal_dolas
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• Streaming Architectural Patterns - Overview
• Design Patterns
o What
o Why
o Illustrations
• QA
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streaming Architectural Patterns
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real Time Streaming Architecture
Source Systems
Sources
Syslog
Machine
Data
External
Streams
Other
Data
Collection
Flume /
Custom
Agent A
Agent B
Agent N
Messaging
System
Kafka
Topic B
Topic N
Topic A
Real Time
Processing
Storm
Topology B
Topology
N
Topology A
Storage
Search
Elastic
Search / Solr
Low Latency
NoSql
HBase
Historic
Hive /
HDFS
Access
Web Services
REST API
Web Apps
Analytic
Tools
R / Python
BI Tools
Alerting
Systems
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lambda Architecture
New Data
Data
Stream
Batch Layer
All Data
Pre-compute
Views
Speed Layer
Stream
Processing
Real Time View
Serving Layer
Batch View
Batch View
Data Access
Query
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Kappa Architecture
Data Source
Data
Stream
Stream Processing
System
Job Version n
Serving DB
Output table n
Output table n +
1
Data Access
Query
Job Version n +
1
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Pattern – What is it?
A General reusable solution to a commonly occurring
problem within a given context in software design.
SolutionReusable Problem
Commonl
y
Occurring
Software
Design
Contextua
l
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns – Why ?
• Streaming use cases have distinct characteristics
o Unpredictable incoming data patterns
o Correlating multiple streams
o Out-of-sequence and late events
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns – Why ?
• High scale and continuous streams pose new challenges
o Peaks and valleys
o Changing data characteristics over period of time
o Maintain the latency and throughput SLAs
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streaming Patterns
Architectural
Patterns
• Real-time
Streaming
• Near-real-time
Streaming
• Lambda
Architecture
• Kappa
Architecture
Functional Patterns
• Stream Joins
• Top N
(Trending)
• Rolling
Windows
Data Management
Patterns
• External Lookup
• Responsive
Shuffling
• Out-of-
Sequence
Events
Data Security
Patterns
• Message
Encryption
• Authorized
Access
• Secure Cluster
Authentication
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streaming Patterns – Being Discussed
Architectural
Patterns
• Real-time
Streaming
• Near-real-time
Streaming
• Lambda
Architecture
• Kappa
Architecture
Functional Patterns
• Stream Joins
• Top N
(Trending)
• Rolling
Windows
Data Management
Patterns
• External Lookup
• Responsive
Shuffling
• Out-of-
Sequence
Events
Data Security
Patterns
• Message
encryption
• Authorized
Access
• Secure Cluster
Authentication
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup
Dynamic, High Speed Enrichments With External Data Lookup
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Description
Referencing frequently changing external system data for
event enrichments, filters or validations
by minimizing the event processing latencies, system
bottlenecks and maintaining high throughput.
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Challenges
• Increased latency due to frequent external system calls
• Insufficient memory to hold all reference data in memory
• Scalability & performance issues with large data
reference sets
• Reference data needs frequent cache purge and
refreshes
• External systems can become a bottleneck
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup – Potential Options
Performance Scalability Fault Tolerance
Always Fetch
Cache Everything
Partition and
Cache on the go
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - A Reference Use Case
• Real Time Credit Card Fraud Identification and Alert
o Credit card transaction data comes as stream (typically through
Kafka)
o External system holds information about the card holder’s recent
location
o Each credit card transaction is looked up against user’s current
location
o If the geographic distance between the credit card transaction
location and user’s recent known location is significant, the credit
card transaction is flagged as potential fraud
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Topology Overview
StormSource Stream
Credit Card
Transaction
Spout
Partitioner
Bolt
Alerting System
External
Reference Data
Fraud
Analyzer Bolt
Locally caches
the user location
data. Cache
validity is time
bound
Partitions data
based on area code
of the mobile
numbers
User Location
Information
Fraud Alert
Email
Looks up user’s current location
from external system and finds
geo distance between
transaction location and user
location
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Peek in the Bolts
Storm
Partitioner Bolt
Instance 2
Partitioner Bolt
Instance 1
Partitioner Bolt
Instance n
Fraud Analyzer
Bolt
Instance 1
CA NV TX
Fraud Analyzer
Bolt
Instance 2
NY CT MA
Fraud Analyzer
Bolt
Instance n
FL NC OH
Stream is partitioned
based on area code
Local cache
(time sensitive)
(Use lightweight
caching solution like
Guava)
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Benefits of the approach
• Only required data is cached (on demand)
• Each bolt caches only partition of reference data
• Data is locally cached so trips to external system are
reduced
• Cache is time sensitive
• On the go cache building handles failures elegantly
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup – Applicability
• Stream processing depends on external data
• External data is sufficiently large that could not be hold in
memory of each task
• External data keeps changing
• External system has scalability limitations
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Description
Automatically adjust shuffling for better performance and
throughput during peaks and varying data skews in streams
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Challenges
• Incoming data stream is unpredictable and can be
skewed
• Skew can change from time to time
• Managing latency and throughput with skews is difficult
• Since streams are continuously flowing, restarting
topology with new shuffling logic is practically not
possible
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Shuffling – Potential Options
Latency &
Throughput
System Reliability Uptime
Static Shuffle
Responsive
Shuffle
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - A Reference Use Case
• Optimized HBase Inserts
o Event data is stored in HBase after storm processing
o Group events such that a bolts can insert more events in HBase
with less trips to region servers
o Over period of time HBase regions can split/merge
o Automatically adjust the event grouping as HBase region layout
changes over period of time
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example – HBase writes w/o responsive shuffling
HBase Bolt
Instance 2
(100 events)
HBase Bolt
Instance 1
(100 events)
HBase Bolt
Instance 3
(100 events)
Region Server
Instance 1
(100 events)
Region Server
Instance 2
(100 events)
Region Server
Instance 3
(100 events)
300
events
sent
300
events
received
9 trips to
region
servers
300
events
sent
App Bolt
Instance 1
(100 events)
App Bolt
Instance 2
(100 events)
App Bolt
Instance 3
(100 events)
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Design
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example – HBase writes with responsive shuffling
HBase Bolt
Instance 2
(100 events)
HBase Bolt
Instance 1
(100 events)
HBase Bolt
Instance 3
(100 events)
Region Server
Instance 1
(100 events)
Region Server
Instance 2
(100 events)
Region Server
Instance 3
(100 events)
300
events
sent
300
events
received
3 trips to
region
servers
300
events
sent
RS Aware
Partitioner
RS Aware
Partitioner
RS Aware
Partitioner
Partitioner
automatically
adapts to
splitting/mergi
ng HBase
regions
App Bolt
Instance 1
(100 events)
App Bolt
Instance 2
(100 events)
App Bolt
Instance 3
(100 events)
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Benefits
• Topology responds to changes in data patterns and
adopts accordingly
• Maintains high level of SLA and throughput adherence
• Minimizes needs for maintenance & hence downtimes
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Applicability
• Change in shuffle pattern does not impact final outcome
• Data stream has varying skews
• Target/Reference system specifications change over
period of time
Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events
Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Description
An out-of-sequence event is one that's received late,
sufficiently late that you've already processed events that
should have been processed after the out-of-sequence
event was received.
Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Challenges
• Hard to determine if all events in given window have
been received
• Need referencing of relevant data for late events
• Builds more pressure on processing components
• Increased latency and degraded overall system
performance
Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events – Potential Options
Latency Result Accuracy Operational Ease
Drop
Wait
Fan Out
Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Processing
Source Spout
Event Filter
Bolt
Typical
Processing
Bolt
Monitors currently being
processed events and
identifying out-of-sequence
events
In sequence
events
Out-of-
Sequence
events
Special
Handling Bolt
Based on
complexities in
processing, this can
be extended as
different topology
Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events – Benefits
• Separation of concerns
• Maintain the the overall throughput and latency
requirements
• Independent scaling of components
Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Applicability
• When order of events matter
• Processing out-of-sequence events needs special and
complex logic
• Stream has relatively low volume of out-of-sequence
events
Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
• Steam application is continuously running process as
opposed to batch process
• Think long term and changing data patterns over period
• Simplicity gives more reliability and predictability
• Use one or more patterns in conjunction to address the
use case
• Patterns are contextual. May not be suitable for every
case.
Page43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You!
sheetal@hortonworks.com
@sheetal_dolas
Page44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Appendix
Page45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka
Page46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka - Description
Ability to use Kafka as secure data transfer mechanism.
Apache Kafka is widely used messaging platform in
streaming applications. Unfortunately Kafka does not have
built in support for Authentication & Authorization (yet)
Page47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka - Flow
Source Systems
Sources
Syslog
Data
Collection
Custom
Collector
Encryptin
g
Producer
Messaging
System
Kafka
Encrypted
Messages
Real Time Processing
Storm
Kafka
Spout
Decryptin
g Bolt
App Bolt
Page48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka – Encryption Details
Data Collection
Event Producer
Messaging
System
Kafka
Topic
Event(s)
Envelope
Real Time Processing
Storm Decrypting Bolt
Event(s) Envelope
Encrypted AES
Key (w/ RSA)
Encrypted Event
(w/ AES)
Event(s)
Envelope
Event(s)
Envelope
Event
Encrypt
event(s)
w/ AES
Encrypt
AES
key w/
RSA
Event
Decrypt
event(s)
w/ AES
Decrypt
AES
key w/
RSA
Page49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka – Encryption Details
• RSA public/private keys are generated ahead of time and
securely shared with topology
• AES key is randomly generated and periodically
refreshed
• Only user having appropriate RSA private key can read
the data
• One event or a batch of events can be encrypted
together as per needs
Page50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka - Applicability
• Multiple applications want to use Kafka as their source to
the stream
• Data is sensitive and can not be shared between
applications
• Other components in the pipeline are secured
Page51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching
Page52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching - Description
Micro-batching is a technique that allows a process or task
to treat a stream as a sequence of small batches or
chunks of data.
For incoming streams, the events can be packaged into
small batches and delivered to a batch system for
processing
Page53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching - Challenges
• Data delivery reliability
• Unnecessary data duplication
• Increased latency
• Complexity in time-bound batching
Page54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching – Potential Options
Simplicity Reusability Reliability
Batch Triggering
Thread
Controller Stream
Tick Tuples
Page55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tick Tuples
Tick tuples are system generated tuples that Storm can
send to your bolt if you need to perform some actions at a
fixed interval
Page56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tick Tuple based Micro Batching - Benefits
• Takes advantages of system characteristic by batching
events together
• Adheres to processing latency needs by ensuring that
batches are executed by certain intervals
• Prevents data loss by acknowledging events only after
successful processing
• Simple, elegant and easy to maintain code
Page57 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching - Applicability
• Target systems are more efficient with bulk transactions
• Processing group of events is more efficient than
individual event
• End to end event latency is not super sensitive
Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching – Sample Code
Page59 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You!
sheetal@hortonworks.com
@sheetal_dolas

More Related Content

What's hot

Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
Guido Schmutz
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
Sudheer Kondla
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
Kent Graziano
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Apache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, ConfluentApache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, Confluent
HostedbyConfluent
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
ChrisFord803185
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 

What's hot (20)

Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Apache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, ConfluentApache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, Confluent
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Design Patterns For Real Time Streaming Data Analytics

HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
Hortonworks
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Ankit Singhal
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Joseph Witt
Joseph WittJoseph Witt
Joseph Witt
AFCEA International
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Carolyn Duby
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 

Similar to Design Patterns For Real Time Streaming Data Analytics (20)

HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Joseph Witt
Joseph WittJoseph Witt
Joseph Witt
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 

Recently uploaded (20)

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 

Design Patterns For Real Time Streaming Data Analytics

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Data Analytics 15 Apr 2015 Sheetal Dolas Principal Architect, Hortonworks
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Who am I ? • Principal Architect @ Hortonworks • Most of the career has been in field, solving real life business problems • Last 5+ years in Big Data including Hadoop, Storm etc. • Co-developed Cisco OpenSOC ( http://opensoc.github.io ) sheetal@hortonworks.com @sheetal_dolas
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • Streaming Architectural Patterns - Overview • Design Patterns o What o Why o Illustrations • QA
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Architectural Patterns
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real Time Streaming Architecture Source Systems Sources Syslog Machine Data External Streams Other Data Collection Flume / Custom Agent A Agent B Agent N Messaging System Kafka Topic B Topic N Topic A Real Time Processing Storm Topology B Topology N Topology A Storage Search Elastic Search / Solr Low Latency NoSql HBase Historic Hive / HDFS Access Web Services REST API Web Apps Analytic Tools R / Python BI Tools Alerting Systems
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lambda Architecture New Data Data Stream Batch Layer All Data Pre-compute Views Speed Layer Stream Processing Real Time View Serving Layer Batch View Batch View Data Access Query
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Kappa Architecture Data Source Data Stream Stream Processing System Job Version n Serving DB Output table n Output table n + 1 Data Access Query Job Version n + 1
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Pattern – What is it? A General reusable solution to a commonly occurring problem within a given context in software design. SolutionReusable Problem Commonl y Occurring Software Design Contextua l
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns – Why ? • Streaming use cases have distinct characteristics o Unpredictable incoming data patterns o Correlating multiple streams o Out-of-sequence and late events
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns – Why ? • High scale and continuous streams pose new challenges o Peaks and valleys o Changing data characteristics over period of time o Maintain the latency and throughput SLAs
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns Architectural Patterns • Real-time Streaming • Near-real-time Streaming • Lambda Architecture • Kappa Architecture Functional Patterns • Stream Joins • Top N (Trending) • Rolling Windows Data Management Patterns • External Lookup • Responsive Shuffling • Out-of- Sequence Events Data Security Patterns • Message Encryption • Authorized Access • Secure Cluster Authentication
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns – Being Discussed Architectural Patterns • Real-time Streaming • Near-real-time Streaming • Lambda Architecture • Kappa Architecture Functional Patterns • Stream Joins • Top N (Trending) • Rolling Windows Data Management Patterns • External Lookup • Responsive Shuffling • Out-of- Sequence Events Data Security Patterns • Message encryption • Authorized Access • Secure Cluster Authentication
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup Dynamic, High Speed Enrichments With External Data Lookup
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Description Referencing frequently changing external system data for event enrichments, filters or validations by minimizing the event processing latencies, system bottlenecks and maintaining high throughput.
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Challenges • Increased latency due to frequent external system calls • Insufficient memory to hold all reference data in memory • Scalability & performance issues with large data reference sets • Reference data needs frequent cache purge and refreshes • External systems can become a bottleneck
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Potential Options Performance Scalability Fault Tolerance Always Fetch Cache Everything Partition and Cache on the go
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case • Real Time Credit Card Fraud Identification and Alert o Credit card transaction data comes as stream (typically through Kafka) o External system holds information about the card holder’s recent location o Each credit card transaction is looked up against user’s current location o If the geographic distance between the credit card transaction location and user’s recent known location is significant, the credit card transaction is flagged as potential fraud
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Topology Overview StormSource Stream Credit Card Transaction Spout Partitioner Bolt Alerting System External Reference Data Fraud Analyzer Bolt Locally caches the user location data. Cache validity is time bound Partitions data based on area code of the mobile numbers User Location Information Fraud Alert Email Looks up user’s current location from external system and finds geo distance between transaction location and user location
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Peek in the Bolts Storm Partitioner Bolt Instance 2 Partitioner Bolt Instance 1 Partitioner Bolt Instance n Fraud Analyzer Bolt Instance 1 CA NV TX Fraud Analyzer Bolt Instance 2 NY CT MA Fraud Analyzer Bolt Instance n FL NC OH Stream is partitioned based on area code Local cache (time sensitive) (Use lightweight caching solution like Guava)
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Benefits of the approach • Only required data is cached (on demand) • Each bolt caches only partition of reference data • Data is locally cached so trips to external system are reduced • Cache is time sensitive • On the go cache building handles failures elegantly
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Applicability • Stream processing depends on external data • External data is sufficiently large that could not be hold in memory of each task • External data keeps changing • External system has scalability limitations
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Description Automatically adjust shuffling for better performance and throughput during peaks and varying data skews in streams
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Challenges • Incoming data stream is unpredictable and can be skewed • Skew can change from time to time • Managing latency and throughput with skews is difficult • Since streams are continuously flowing, restarting topology with new shuffling logic is practically not possible
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Shuffling – Potential Options Latency & Throughput System Reliability Uptime Static Shuffle Responsive Shuffle
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case • Optimized HBase Inserts o Event data is stored in HBase after storm processing o Group events such that a bolts can insert more events in HBase with less trips to region servers o Over period of time HBase regions can split/merge o Automatically adjust the event grouping as HBase region layout changes over period of time
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes w/o responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 9 trips to region servers 300 events sent App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Design
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes with responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 3 trips to region servers 300 events sent RS Aware Partitioner RS Aware Partitioner RS Aware Partitioner Partitioner automatically adapts to splitting/mergi ng HBase regions App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)
  • 31. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Benefits • Topology responds to changes in data patterns and adopts accordingly • Maintains high level of SLA and throughput adherence • Minimizes needs for maintenance & hence downtimes
  • 32. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Applicability • Change in shuffle pattern does not impact final outcome • Data stream has varying skews • Target/Reference system specifications change over period of time
  • 33. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events
  • 34. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Description An out-of-sequence event is one that's received late, sufficiently late that you've already processed events that should have been processed after the out-of-sequence event was received.
  • 35. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Challenges • Hard to determine if all events in given window have been received • Need referencing of relevant data for late events • Builds more pressure on processing components • Increased latency and degraded overall system performance
  • 36. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Potential Options Latency Result Accuracy Operational Ease Drop Wait Fan Out
  • 37. Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Processing Source Spout Event Filter Bolt Typical Processing Bolt Monitors currently being processed events and identifying out-of-sequence events In sequence events Out-of- Sequence events Special Handling Bolt Based on complexities in processing, this can be extended as different topology
  • 38. Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Benefits • Separation of concerns • Maintain the the overall throughput and latency requirements • Independent scaling of components
  • 39. Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Applicability • When order of events matter • Processing out-of-sequence events needs special and complex logic • Stream has relatively low volume of out-of-sequence events
  • 40. Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary
  • 41. Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary • Steam application is continuously running process as opposed to batch process • Think long term and changing data patterns over period • Simplicity gives more reliability and predictability • Use one or more patterns in conjunction to address the use case • Patterns are contextual. May not be suitable for every case.
  • 42. Page43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas
  • 43. Page44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Appendix
  • 44. Page45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka
  • 45. Page46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Description Ability to use Kafka as secure data transfer mechanism. Apache Kafka is widely used messaging platform in streaming applications. Unfortunately Kafka does not have built in support for Authentication & Authorization (yet)
  • 46. Page47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Flow Source Systems Sources Syslog Data Collection Custom Collector Encryptin g Producer Messaging System Kafka Encrypted Messages Real Time Processing Storm Kafka Spout Decryptin g Bolt App Bolt
  • 47. Page48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details Data Collection Event Producer Messaging System Kafka Topic Event(s) Envelope Real Time Processing Storm Decrypting Bolt Event(s) Envelope Encrypted AES Key (w/ RSA) Encrypted Event (w/ AES) Event(s) Envelope Event(s) Envelope Event Encrypt event(s) w/ AES Encrypt AES key w/ RSA Event Decrypt event(s) w/ AES Decrypt AES key w/ RSA
  • 48. Page49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details • RSA public/private keys are generated ahead of time and securely shared with topology • AES key is randomly generated and periodically refreshed • Only user having appropriate RSA private key can read the data • One event or a batch of events can be encrypted together as per needs
  • 49. Page50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Applicability • Multiple applications want to use Kafka as their source to the stream • Data is sensitive and can not be shared between applications • Other components in the pipeline are secured
  • 50. Page51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching
  • 51. Page52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Description Micro-batching is a technique that allows a process or task to treat a stream as a sequence of small batches or chunks of data. For incoming streams, the events can be packaged into small batches and delivered to a batch system for processing
  • 52. Page53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Challenges • Data delivery reliability • Unnecessary data duplication • Increased latency • Complexity in time-bound batching
  • 53. Page54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching – Potential Options Simplicity Reusability Reliability Batch Triggering Thread Controller Stream Tick Tuples
  • 54. Page55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tick Tuples Tick tuples are system generated tuples that Storm can send to your bolt if you need to perform some actions at a fixed interval
  • 55. Page56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tick Tuple based Micro Batching - Benefits • Takes advantages of system characteristic by batching events together • Adheres to processing latency needs by ensuring that batches are executed by certain intervals • Prevents data loss by acknowledging events only after successful processing • Simple, elegant and easy to maintain code
  • 56. Page57 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Applicability • Target systems are more efficient with bulk transactions • Processing group of events is more efficient than individual event • End to end event latency is not super sensitive
  • 57. Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching – Sample Code
  • 58. Page59 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas

Editor's Notes

  1. As businesses are realizing the power Hadoop and large data analytics, many businesses are demanding large scale real time streaming data analytics. Apache Storm and Apache Spark are platforms that can process large amount of data in real time. However building applications on these platforms that can scale, reliably process data without any loss, satisfy functional needs and at the same time meet the strict latency requirements, takes lot of work to get it right. After implementing multiple large real time data processing applications using these technologies in various business domains, we distilled commonly required solutions into generalized design patterns. These patterns are proven in the very large production deployments where they process millions of events per second, tens of billions of events per day and tens of terabytes of data per day.
  2. All data is dispatched to both the batch layer and the speed layer batch layer - (i) manage the master dataset and (ii) to pre-compute the batch views. The serving layer indexes the batch views for low-latency, ad-hoc queries The speed layer compensates for the high latency and deals with recent data only incoming query can be answered by merging batch and real-time views
  3. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. When the second job has caught up, switch the application to read from the new table. Stop the old version of the job, and delete the old output table.
  4. Not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem Patterns are formalized best practices
  5. Credit card transaction data comes as stream (typically through Kafka) An external system has information about the credit card holder’s recent location (collected from GPS on mobile device and/or from mobile towers) Each credit card transaction is looked up against user’s current location If the geographic distance between the credit card transaction location and user’s recent known location is significant (say 100 miles), the credit card transaction is flagged as potential fraud
  6. Only required data is cached (on demand) Hence reduced cache size requirements Each bolt caches only partition of reference data No duplicate caching. Reduced cache size requirements Can process more data with same RAM available Data is locally cached so trips to external system are reduced Reduced latency and increased system throughput Reduced load on external system Cache is time sensitive Provides ability to refresh cache after certain intervals for dynamic reference data On the go cache building handles failures elegantly Cache gets auto built as the events are re processed and no additional handling needed Also the data patterns are more predictable so you can also pre build cache on component start
  7. Out-of-sequence events can come very late and processing them would need referencing of relevant data In streaming applications, it is hard to determine if all events in given window have been received Out-of-sequence events can come very late that it can build more pressure on processing components as they need to wait longer as well as do additional processing for very old events The complexity can increase latency of events processed and degrade overall system performance
  8. Separation of concerns – Separate the processing responsibilities between typical event processing and exceptional event processing Typical processing components and Special handling components can be scaled independently (parallelism, memory needs, latency needs)
  9. When order of events matter - Input stream may have out-of-sequence events that need to be processed appropriately
  10. As businesses are realizing the power Hadoop and large data analytics, many businesses are demanding large scale real time streaming data analytics. Apache Storm and Apache Spark are platforms that can process large amount of data in real time. However building applications on these platforms that can scale, reliably process data without any loss, satisfy functional needs and at the same time meet the strict latency requirements, takes lot of work to get it right. After implementing multiple large real time data processing applications using these technologies in various business domains, we distilled commonly required solutions into generalized design patterns. These patterns are proven in the very large production deployments where they process millions of events per second, tens of billions of events per day and tens of terabytes of data per day.