1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi and Stream
Processing
Dhruv Kumar
Sr. Solutions Architect
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Simplistic View of Enterprise Data Flow
Store Data
Process and
Analyze Data
Acquire Data
Dataflow
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Realistic View of Enterprise Data Flow
?
?
?
?
?
?
?
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
Producer
C1
Consumer
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache NiFi: The three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and
data plane
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Command & Control
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache NiFi – Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Matured at NSA 2006-2014
Brief history of the Apache NiFi Community
Code developed
at NSA
2006
Today
Achieved TLP
status in just
7 months
July 2015
Dev mailing list
Users mailing list*
182 subscribers producing ~100 emails/week
165 subscribers producing ~40 emails/week
55
125
1170
Code contributors
Pull requests via Github
JIRAs Filed.
Code available
open source
ASL v2
December 2014
*Only 5 months old
In 11 months…
6Targeting a 6-8
week release cycle
Releases 153 new in last two months
With more in pipeline
Committers 13 PMC Members Affiliations
Hortonworks, Twitter, Cloudera, US
Government, Defense Contractors, etc.
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing,
transformation, or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing
various processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and
manages the threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send
data via ports. A process group allows creation of entirely new
component simply by composition of its components.
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Architecture
OS/Host
JVM
NiFi Cluster Manger – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi’s uses are many…
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Joins / Complex Rolling Window Operations
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges
Aggregate all IoAT data from sensors, geo-location devices, machines, logs,
files, and feeds via a highly secure lightweight agent
Collect: Bring Together• Logs
• Files
• Feeds
• Sensors
Mediate point-to-point and bi-directional data flows, delivering data
reliably to real-time applications and storage platforms such as HDP
Conduct: Mediate the Data Flow• Deliver
• Secure
• Govern
• Audit
Parse, filter, join, transform, fork, and clone data in motion to
empower analytics and perishable insights
Curate: Gain Insights• Parse
• Filter
• Transform
• Fork
• Clone
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP + HDF Create Modern Data Apps
DATA AT
REST
HDF DATA
IN MOTION
ACTIONABLE
INTELLIGENCE
MODERN DATA APPS
Real-Time Cyber Security
protects systems with superior threat detection
Smart Manufacturing
dramatically improves yields by managing more
variables in greater detail
Connected, Autonomous Cars
drive themselves and improve road safety
Future Farming
optimizing soil, seeds and equipment to measured
conditions on each square foot
Automatic Recommendation Engines
match products to preferences in milliseconds
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Architectures
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Drive Data to Core for Analysis
NiFi
Stream
Processing
MiNiFi
MiNiFi
• Drive data from sources to central data center for analysis
• Tiered collection approach at various locations, think regional data centers
Edge
Edge
Core
Batch
Analytics
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamically Adjusting Data Flows
• Push contents back to core NiFi
• Push results back to edge locations/devices to change behavior
NiFi
MiNiFi
MiNiFi
Edge
Edge
Core
Batch
Analytics
Stream
Processing
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Storm
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
Hortonworks DataFlow Reference Architecture
DB
Data WH
 Tiered processing framework
 Bi-directional communication
 Data prioritization
 Interactive command & control in the center, design & deploy on the edge
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
Hortonworks DataFlow Reference Architecture
 Campaign management: coupons/promotions/etc.
 Location based services
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
Hortonworks DataFlow Reference Architecture
 Transaction processing
 Fraud detection
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
Hortonworks DataFlow Reference Architecture
 Complex processing and cloud computing
 Historical data analytics based on nightly updates
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi vs Kafka
NiFi
Good for data traceability
and flow management
• Interactive command and control – real time
operational visibility
• Data provenance – real time visual chain of
custody
• Low scripting maintenance
⚠ Requires adding/removing processors
according to consumer-side updates
Kafka
Good for large number of consumers
and dynamic consumer-side updates
• Low latency
• Great data durability
• Support large number of
producers/consumers
⚠ Not optimized to manage dataflows
(prioritization, enrichment, protocols, formats,
event level authorizations, objects with various
sizes, etc.)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi vs Storm
NiFi
Good for data traceability, flow
management, and enrichment
• Data provenance – real time visual chain
of custody
• Security – end-to-end secure routing with
event level authorization
• Simple event processing
⚠ Scaling model allowing for processor level
workload to be only evenly distributed
across worker nodes
Storm
Good for streaming analytics
• Complex event processing
• Flexible scaling model, allowing to specify
workload distribution on-demand at bolt level
⚠ Not designed to manage data flows
In a nutshell…
NiFi
Hadoop
HDFS
HBase Hive SOLR
YARN
Storm
Service
Management /
Workflow
SIEM
Spark
Raw Network Stream
Network Metadata Stream
Data Stores
Syslog
Raw Application Logs
Other Streaming Telemetry
Key Tenants of Lambda Architecture
 Batch Layer
 Manages master data
 Immutable, append-only set of raw data
 Cleanse, Normalize & Pre-Compute
Batch Views
 Advanced Statistical Calculations
 Speed layer
 Real Time Event Stream Processing
 Computes Real-Time Views
 Serving Layer
 Low-latency, ad-hoc query
 Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
Fundamental Principles of Streaming Architectures
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm/Spark Streaming
Storm
Detailed Reference Architecture for IoT Applications
HDF
Flume
Sink to
HDFS
Transform
Interactive
UI Framework
Hive
Hive
HDFS
HDFS
SOURCE DATA
Server logs
Application Logs
Firewall Logs
CRM/ERP
Sensor
Kafka
Kafka
Stream to
HDF
Forward to
Storm
Real Time Storage
Spark-ML
Pig
Alerts
Bolt to
HDFS
Dashboard
Silk
JMS
Alerts
Hive Server
HiveServer
Reporting
BI Tools
High Speed
Ingest
Real-Time
Batch Interactive
Machine Learning
Models
Spark
Pig
AlertsSQOOP
Flume
Iterative ML
Hbase/Pheonix
HBaseEvent Enrichment
Spark-Thrift
Pig
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo!

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks

  • 1.
    1 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi and Stream Processing Dhruv Kumar Sr. Solutions Architect
  • 2.
    Page2 © HortonworksInc. 2011 – 2015. All Rights Reserved Simplistic View of Enterprise Data Flow Store Data Process and Analyze Data Acquire Data Dataflow
  • 3.
    Page3 © HortonworksInc. 2011 – 2015. All Rights Reserved Realistic View of Enterprise Data Flow ? ? ? ? ? ? ?
  • 4.
    Page4 © HortonworksInc. 2011 – 2015. All Rights Reserved Basics of Connecting Systems For every connection, these must agree: 1. Protocol 2. Format 3. Schema 4. Priority 5. Size of event 6. Frequency of event 7. Authorization access 8. Relevance P1 Producer C1 Consumer
  • 5.
    Page5 © HortonworksInc. 2011 – 2015. All Rights Reserved Apache NiFi: The three key concepts • Manage the flow of information • Data Provenance • Secure the control plane and data plane
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Visual Command & Control • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 7.
    Page7 © HortonworksInc. 2011 – 2015. All Rights Reserved Apache NiFi – Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 8.
    Page8 © HortonworksInc. 2011 – 2015. All Rights Reserved Matured at NSA 2006-2014 Brief history of the Apache NiFi Community Code developed at NSA 2006 Today Achieved TLP status in just 7 months July 2015 Dev mailing list Users mailing list* 182 subscribers producing ~100 emails/week 165 subscribers producing ~40 emails/week 55 125 1170 Code contributors Pull requests via Github JIRAs Filed. Code available open source ASL v2 December 2014 *Only 5 months old In 11 months… 6Targeting a 6-8 week release cycle Releases 153 new in last two months With more in pipeline Committers 13 PMC Members Affiliations Hortonworks, Twitter, Cloudera, US Government, Defense Contractors, etc.
  • 9.
    Page9 © HortonworksInc. 2011 – 2015. All Rights Reserved Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  • 10.
    Page10 © HortonworksInc. 2011 – 2015. All Rights Reserved OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Architecture OS/Host JVM NiFi Cluster Manger – Request Replicator Web Server Master NiFi Cluster Manager (NCM) OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Slaves NiFi Nodes
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi’s uses are many… What is Apache NiFi used for? • Reliable and secure transfer of data between systems • Delivery of data from sources to analytic platforms • Enrichment and preparation of data: – Conversion between formats – Extraction/Parsing – Routing decisions What is Apache NiFi NOT used for? • Distributed Computation • Complex Event Processing • Joins / Complex Rolling Window Operations
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges Aggregate all IoAT data from sensors, geo-location devices, machines, logs, files, and feeds via a highly secure lightweight agent Collect: Bring Together• Logs • Files • Feeds • Sensors Mediate point-to-point and bi-directional data flows, delivering data reliably to real-time applications and storage platforms such as HDP Conduct: Mediate the Data Flow• Deliver • Secure • Govern • Audit Parse, filter, join, transform, fork, and clone data in motion to empower analytics and perishable insights Curate: Gain Insights• Parse • Filter • Transform • Fork • Clone
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved HDP + HDF Create Modern Data Apps DATA AT REST HDF DATA IN MOTION ACTIONABLE INTELLIGENCE MODERN DATA APPS Real-Time Cyber Security protects systems with superior threat detection Smart Manufacturing dramatically improves yields by managing more variables in greater detail Connected, Autonomous Cars drive themselves and improve road safety Future Farming optimizing soil, seeds and equipment to measured conditions on each square foot Automatic Recommendation Engines match products to preferences in milliseconds
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Streaming Architectures
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Drive Data to Core for Analysis NiFi Stream Processing MiNiFi MiNiFi • Drive data from sources to central data center for analysis • Tiered collection approach at various locations, think regional data centers Edge Edge Core Batch Analytics
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Dynamically Adjusting Data Flows • Push contents back to core NiFi • Push results back to edge locations/devices to change behavior NiFi MiNiFi MiNiFi Edge Edge Core Batch Analytics Stream Processing
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Core Data Center Server Cluster NiFi NiFi NiFi Others Storm Kafka Spark/Flink/etc. AWS Azure Google Cloud Hortonworks DataFlow Reference Architecture DB Data WH  Tiered processing framework  Bi-directional communication  Data prioritization  Interactive command & control in the center, design & deploy on the edge
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Storm Hortonworks DataFlow Reference Architecture  Campaign management: coupons/promotions/etc.  Location based services Core Data Center Server Cluster NiFi NiFi NiFi Others Kafka Spark/Flink/etc. AWS Azure Google Cloud DB Data WH
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Storm Hortonworks DataFlow Reference Architecture  Transaction processing  Fraud detection Core Data Center Server Cluster NiFi NiFi NiFi Others Kafka Spark/Flink/etc. AWS Azure Google Cloud DB Data WH
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Storm Hortonworks DataFlow Reference Architecture  Complex processing and cloud computing  Historical data analytics based on nightly updates Core Data Center Server Cluster NiFi NiFi NiFi Others Kafka Spark/Flink/etc. AWS Azure Google Cloud DB Data WH
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi vs Kafka NiFi Good for data traceability and flow management • Interactive command and control – real time operational visibility • Data provenance – real time visual chain of custody • Low scripting maintenance ⚠ Requires adding/removing processors according to consumer-side updates Kafka Good for large number of consumers and dynamic consumer-side updates • Low latency • Great data durability • Support large number of producers/consumers ⚠ Not optimized to manage dataflows (prioritization, enrichment, protocols, formats, event level authorizations, objects with various sizes, etc.)
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi vs Storm NiFi Good for data traceability, flow management, and enrichment • Data provenance – real time visual chain of custody • Security – end-to-end secure routing with event level authorization • Simple event processing ⚠ Scaling model allowing for processor level workload to be only evenly distributed across worker nodes Storm Good for streaming analytics • Complex event processing • Flexible scaling model, allowing to specify workload distribution on-demand at bolt level ⚠ Not designed to manage data flows
  • 23.
    In a nutshell… NiFi Hadoop HDFS HBaseHive SOLR YARN Storm Service Management / Workflow SIEM Spark Raw Network Stream Network Metadata Stream Data Stores Syslog Raw Application Logs Other Streaming Telemetry
  • 24.
    Key Tenants ofLambda Architecture  Batch Layer  Manages master data  Immutable, append-only set of raw data  Cleanse, Normalize & Pre-Compute Batch Views  Advanced Statistical Calculations  Speed layer  Real Time Event Stream Processing  Computes Real-Time Views  Serving Layer  Low-latency, ad-hoc query  Reporting, BI & Dashboard New Data Stream Store Pre-Compute Views Process Streams Incremental Views Business View Business View Query SPEED LAYER BATCH LAYER SERVING LAYER HDP and HDF Fundamental Principles of Streaming Architectures
  • 25.
    Page25 © HortonworksInc. 2011 – 2015. All Rights Reserved Storm/Spark Streaming Storm Detailed Reference Architecture for IoT Applications HDF Flume Sink to HDFS Transform Interactive UI Framework Hive Hive HDFS HDFS SOURCE DATA Server logs Application Logs Firewall Logs CRM/ERP Sensor Kafka Kafka Stream to HDF Forward to Storm Real Time Storage Spark-ML Pig Alerts Bolt to HDFS Dashboard Silk JMS Alerts Hive Server HiveServer Reporting BI Tools High Speed Ingest Real-Time Batch Interactive Machine Learning Models Spark Pig AlertsSQOOP Flume Iterative ML Hbase/Pheonix HBaseEvent Enrichment Spark-Thrift Pig
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Demo!

Editor's Notes

  • #10 Introduce Flow Based Programming fundamentals, why they matter, and how NiFi adopts them
  • #11 Introduce the architecture of NiFi, describe major system components, and describe the single node and clustering models. For each component describe its available (and potential)deployment models (relate it to Hadoop).
  • #13 HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges - HDF provides 3 key capabilities – the ability to collect data from different types of data sources via a highly secure lightweigt agent, the ability to mediate the data flow to/from the data source and the “collector”, and the ability to trace, parse, transform data in motion to enable analytics and derive insights within an operationally relevant time window. Systems fail Networks fail, disks fail, software crashes, people make mistakes. Data access exceeds capacity to consume Sometimes a given data source can outpace some part of the processing or delivery chain - it only takes one weak-link to have an issue. Boundary conditions are mere suggestions You will invariably get data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format. What is noise one day becomes signal the next Priorities of an organization change - rapidly. Enabling new flows and changing existing ones must be fast. Systems evolve at different rates The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together. Compliance and security Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted, accountable. Continuous improvement occurs in production It is often not possible to come even close to replicating production environments in the lab.
  • #14 TALK TRACK Here are just a few of the modern data apps that convert yesterday’s impossible challenges into today’s new products, cures, conveniences and life saving innovations. These apps are either custom-built by our customers or they come of the shelf, created by Hortonworks or one of of our ecosystem partners to solve a particular problem. Symantec and other cyber security leaders have built powerful apps to detect threats to digital information. Leading pharma, automotive, consumer electronics and packaged goods companies are building their factories of the future that use actionable intelligence to improve manufacturing yields. And age-old industries like automotive, agriculture and retail are taking connected data platforms on the road, through the field or to the cash register to do things that have never before been possible. [NEXT SLIDE]
  • #18 Tiered processing framework: often times not necessary to centralize every thing back to data center. Processing can happen in regional offices as well as on the edge devices, for efficiency (fraud detection logic defined in branch offices, etc.) Bi-directional communication: real-time analytical results can be pushed back to the edge, adjust flow behavior accordingly. Example: prioritize data collection based on real-time bandwidth (calculated in DC with Flink jobs); fraud detection, send triggering events back to the edge to block transactions in real-time Data prioritization: prioritize data flow, example: higher priority data can be sent back via LTE, lower priority data can wait until wifi becomes available. Interactive vs design/deploy: in data center, complex flow, interactive command and control, allowing users to fix pipes without shutting down the water; design data flow with a visual interface in DC, and push to multiple MINIFI agents with one click (also providing a centralized place to version control flows on all the agents).
  • #24 CapOne – Ingesting from everywhere Email, Syslog, Applog, Netflow… Moving to “Cloud Only model”….even looking to use “docker Containers” in Amazon…
  • #25 Roll forward a few years, Hadoop today provides a complete platform to address the batch, serving and speed layers of the Lambda Architecture.
  • #26 The team puts together a detailed architecture of the proposed solution using HDP and HDF. The architecture considers sources data from the numerous sources including Server Logs, Application Logs, XML and Senso data. This data is easily accepted into the flexible schema of HDP using HDF and Sqoop. The data is processed using Pig and analyzed using Spark. Then the data is made available in a real-time dashboard as well as to visualization and reporting tools. [NEXT SLIDE]