SlideShare a Scribd company logo
1 of 31
Data at Scales &
the Values of
Starting Small
Aldrin Piri - @aldrinpiri
DataWorks Summit 2017 – Munich
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key: 'Apache NiFi’
Value: 'PMC Member'
Key: 'Work’
Value: ’Sr. Member of Technical Staff @ Hortonworks'
Key: 'Working with NiFi Since’
Value: '2010’
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Apache NiFi: A Primer
Apache MiNiFi
Architecture
Apache NiFi: The Ecosystem
Community
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Byte Scales for Data
SI Prefix
- 10
0
kilo 10
3
mega 10
6
giga 10
9
tera 10
12
peta 10
15
exa 10
18
zetta 10
21
yotta 10
24
“Big Data”
”everything else”
Greek
for
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Problem at Hand
Producers A.K.A Things
Anything
AND
Everything
Internet!
Consumers
• User
• Storage
• System
• …More Things
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case: Courier Service
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark /
Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
NiFi NiFi NiFi NiFi NiFi NiFi
Gathering data from disparate sources
NiFi
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi: A Primer
Key Features and Principles
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi is based on Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing, transformation,
or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages the
threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send data via
ports. A process group allows creation of entirely new component simply by
composition of its components.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles are like HTTP data
HTTP Data FlowFile
HTTP/1.1 200 OK
Date: Sun, 10 Oct 2010 23:26:07 GMT
Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g
Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT
ETag: "45b6-834-49130cc1182c0"
Accept-Ranges: bytes
Content-Length: 13
Connection: close
Content-Type: text/html
Hello world!
Standard FlowFile Attributes
Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'fileSize’ Value: '23609'
FlowFile Attribute Map Content
Key: 'filename’ Value: '15650246997242'
Key: 'path’ Value: './’
Binary Content *
Header
Content
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles & Data Agnosticism
 NiFi is data agnostic!
 But, NiFi was designed understanding that users
can care about specifics and provides tooling
to interact with specific formats, protocols, etc.
ISO 8601 - http://xkcd.com/1179/
Robustness principle
Be conservative in what you do,
be liberal in what you accept from others“
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache MiNiFi
 NiFi lives in the data center. Give it an
enterprise server or a cluster of them.
 MiNiFi lives as close to where data is born
and is a guest on that device or system
“Let me get the key parts of NiFi close to where data begins and provide bidirectional
data transfer"
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache MiNiFi
 Limited computing capability
 Limited power/network
 Restricted software library/platform availability
 No UI
 Physically inaccessible
 Not frequently updated
 Competing standards/protocols
 Scalability
 Privacy & Security
Realities of computing outside the comforts of the data center
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi: Precedent from NiFi
 Provides the semantics between two NiFi components across network boundaries
– A custom protocol for inter-NiFi communication
– Secure, Extensible, Load Balanced & Scalable Delivery to Cluster
 Extracted out to a client library which powers integration into popular frameworks like
Apache Spark, Apache Storm, Apache Flink, and Apache Apex
 Attributes and the FlowFile format maintained
A quick look at NiFi Site to Site
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi: Precedent from NiFi
 Fine-grained, event level access of interactions with FlowFiles
– CREATE, RECEIVE, FETCH, SEND, DOWNLOAD, DROP, EXPIRE, FORK, JOIN …
 Captures the associated attributes/metadata at the time of the event
 A map of a FlowFile’s journey and how they relate to other FlowFiles in a system
– MiNiFi enables us to get more and further illuminate the map of data processing
A deeper dive into provenance
http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data-provenance
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi: Precedent from NiFi
RECEIVE event
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache MiNiFi
 The feedback loop is longer and not
guaranteed
– Removal of Web Server and UI
 Declarative configuration
– Lends itself well to CM processes
– Extensible interface to support varying formats
• Currently provided in YAML
 Reduced set of bundled components
Departures from NiFi in getting the right fit
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache MiNiFi: Scoping
 Go small: Java – Write once, run anywhere*
– Feature parity and reuse of core NiFi libraries
 Go smaller: C++ – Write once**, run anywhere
 Go smallest: Write n-many times, run anywhere
Language libraries to support tagging, FlowFile format, Site to Site protocol, and
provenance generation without a processing framework
– Mobile: Android & iOS
– Language SDKs
Provide all the key principles of NiFi in varying, smaller footprints
WHAT IS THIS!?
A NiFi FOR ANTS!?!
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi: Use Case - Connected Car
 Outside vehicle’s
network firewall
 On telematics layer
VEHICLE NETWORK FIREWALL
TRANSMIT
EXECUTE FILTER
PRIORITIZE
PARSE
LISTEN
ROUTE
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connecting the Drops
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case: Courier Service with Apache NiFi & MiNiFi
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark /
Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
Client
Libraries
Client
Libraries
MiNiFi
MiNiFi
NiFi NiFi NiFi NiFi NiFi NiFi
Client
Libraries
Gathering data from disparate sources
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2011 – 2017. All Rights ReservedX
Data Provenance
▪ Constrained
▪ High-latency
▪ Localized context
▪ Hybrid – cloud/on-premises
▪ Low-latency
▪ Global context
Origin – attribution
Replay – recovery
Evolution of topologies
Long retention
Types of Lineage
• Event
• Configuration
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi: The Ecosystem
 Site-to-Site in MiNiFi instances provides machine-to-machine (M2M) communication
– Data arrives to NiFi in a transparent manner allowing integration to existing flows
 Similar attention to extensibility in both Java and C++ clients allows agents to fit the
needs of your organization
 Reduced footprint allows NiFi functionality to aid in production of high fidelity data,
more closely attributable and tracked from where it is generated
We’ve provided a framework to extend the reach of data ingest
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi: The Ecosystem
 Enter the MiNiFi Command & Control
– Provide tooling to map the UX of
interactive command and control in NiFi
to the design and deploy approach of
MiNiFi
But more instances complicate my operational management!
https://cwiki.apache.org/confluence/display/MINIFI/MiNiFi+Command+and+Control
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi: The Ecosystem
 Configuration Management of Flows & Versioning
– The evolution of templates to better support SDLC functions
– https://cwiki.apache.org/confluence/display/NIFI/Configuration+Management+of+Flows
 Extension Repositories
– Publish & Share extension bundles (NARs)
– https://cwiki.apache.org/confluence/display/NIFI/Extension+Repositories+%28aka+Extension+Regis
try%29+for+Dynamically-loaded+Extensions
 Variable Registry
– Initial framework support & file-based implementation
– https://cwiki.apache.org/confluence/display/NIFI/Variable+Registry
Building on efforts for reusable components in the community
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Apache NiFi & MiNiFi?
 Moving data is multifaceted in its challenges and these are present in different contexts
at varying scopes
– Think of our courier example and organizations like it: inter vs intra, domestically, internationally
 Provide common tooling and extensions that are commonly needed but be flexible for
extension
– Leverage existing libraries and expansive Java ecosystem for functionality
– Allow organizations to integrate with their existing infrastructure
 Empower folks managing your infrastructure to make changes and reason about issues
that are occurring
– Data Provenance to show context and data’s journey
– User Interface/Experience a key component
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn more and join us!
Apache NiFi site
https://nifi.apache.org
Subproject MiNiFi site
https://nifi.apache.org/minifi/
Subscribe to and collaborate at
dev@nifi.apache.org
users@nifi.apache.org
Submit Ideas or Issues
https://issues.apache.org/jira/browse/NIFI
https://issues.apache.org/jira/browse/MINIFI
Follow us on Twitter
@apachenifi
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Crash Course
Thursday, 6 April
11:15 AM – 1:45PM, Room 12
• Learn more about NiFi, the community, and work through a hands-on lab
• Seats available on a first come, first served basis
• Make sure you are in possession of the latest version of VirtualBox
• More details: https://tinyurl.com/nifi-cc-munich17
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn, Share at Birds of a Feather
IOT, STREAMING & DATA FLOW
Thursday, April 6
5:50 pm, Room 5
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

More Related Content

More from DataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Data at Scales & The Values of Starting Small

  • 1. Data at Scales & the Values of Starting Small Aldrin Piri - @aldrinpiri DataWorks Summit 2017 – Munich
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key: 'Apache NiFi’ Value: 'PMC Member' Key: 'Work’ Value: ’Sr. Member of Technical Staff @ Hortonworks' Key: 'Working with NiFi Since’ Value: '2010’
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Apache NiFi: A Primer Apache MiNiFi Architecture Apache NiFi: The Ecosystem Community
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Byte Scales for Data SI Prefix - 10 0 kilo 10 3 mega 10 6 giga 10 9 tera 10 12 peta 10 15 exa 10 18 zetta 10 21 yotta 10 24 “Big Data” ”everything else” Greek for
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Problem at Hand Producers A.K.A Things Anything AND Everything Internet! Consumers • User • Storage • System • …More Things
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case: Courier Service Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Kafka Core Data Center at HQ Server Cluster Others Storm / Spark / Flink / Apex Kafka Storm / Spark / Flink / Apex On Delivery Routes Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/ NiFi NiFi NiFi NiFi NiFi NiFi Gathering data from disparate sources NiFi
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi: A Primer Key Features and Principles • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi is based on Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FlowFiles are like HTTP data HTTP Data FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT ETag: "45b6-834-49130cc1182c0" Accept-Ranges: bytes Content-Length: 13 Connection: close Content-Type: text/html Hello world! Standard FlowFile Attributes Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'fileSize’ Value: '23609' FlowFile Attribute Map Content Key: 'filename’ Value: '15650246997242' Key: 'path’ Value: './’ Binary Content * Header Content
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FlowFiles & Data Agnosticism  NiFi is data agnostic!  But, NiFi was designed understanding that users can care about specifics and provides tooling to interact with specific formats, protocols, etc. ISO 8601 - http://xkcd.com/1179/ Robustness principle Be conservative in what you do, be liberal in what you accept from others“
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache MiNiFi  NiFi lives in the data center. Give it an enterprise server or a cluster of them.  MiNiFi lives as close to where data is born and is a guest on that device or system “Let me get the key parts of NiFi close to where data begins and provide bidirectional data transfer"
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache MiNiFi  Limited computing capability  Limited power/network  Restricted software library/platform availability  No UI  Physically inaccessible  Not frequently updated  Competing standards/protocols  Scalability  Privacy & Security Realities of computing outside the comforts of the data center
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MiNiFi: Precedent from NiFi  Provides the semantics between two NiFi components across network boundaries – A custom protocol for inter-NiFi communication – Secure, Extensible, Load Balanced & Scalable Delivery to Cluster  Extracted out to a client library which powers integration into popular frameworks like Apache Spark, Apache Storm, Apache Flink, and Apache Apex  Attributes and the FlowFile format maintained A quick look at NiFi Site to Site https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MiNiFi: Precedent from NiFi  Fine-grained, event level access of interactions with FlowFiles – CREATE, RECEIVE, FETCH, SEND, DOWNLOAD, DROP, EXPIRE, FORK, JOIN …  Captures the associated attributes/metadata at the time of the event  A map of a FlowFile’s journey and how they relate to other FlowFiles in a system – MiNiFi enables us to get more and further illuminate the map of data processing A deeper dive into provenance http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data-provenance
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MiNiFi: Precedent from NiFi RECEIVE event
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache MiNiFi  The feedback loop is longer and not guaranteed – Removal of Web Server and UI  Declarative configuration – Lends itself well to CM processes – Extensible interface to support varying formats • Currently provided in YAML  Reduced set of bundled components Departures from NiFi in getting the right fit
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache MiNiFi: Scoping  Go small: Java – Write once, run anywhere* – Feature parity and reuse of core NiFi libraries  Go smaller: C++ – Write once**, run anywhere  Go smallest: Write n-many times, run anywhere Language libraries to support tagging, FlowFile format, Site to Site protocol, and provenance generation without a processing framework – Mobile: Android & iOS – Language SDKs Provide all the key principles of NiFi in varying, smaller footprints
  • 19. WHAT IS THIS!? A NiFi FOR ANTS!?!
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MiNiFi: Use Case - Connected Car  Outside vehicle’s network firewall  On telematics layer VEHICLE NETWORK FIREWALL TRANSMIT EXECUTE FILTER PRIORITIZE PARSE LISTEN ROUTE
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Connecting the Drops SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case: Courier Service with Apache NiFi & MiNiFi Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Kafka Core Data Center at HQ Server Cluster Others Storm / Spark / Flink / Apex Kafka Storm / Spark / Flink / Apex On Delivery Routes Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/ Client Libraries Client Libraries MiNiFi MiNiFi NiFi NiFi NiFi NiFi NiFi NiFi Client Libraries Gathering data from disparate sources
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2011 – 2017. All Rights ReservedX Data Provenance ▪ Constrained ▪ High-latency ▪ Localized context ▪ Hybrid – cloud/on-premises ▪ Low-latency ▪ Global context Origin – attribution Replay – recovery Evolution of topologies Long retention Types of Lineage • Event • Configuration
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi: The Ecosystem  Site-to-Site in MiNiFi instances provides machine-to-machine (M2M) communication – Data arrives to NiFi in a transparent manner allowing integration to existing flows  Similar attention to extensibility in both Java and C++ clients allows agents to fit the needs of your organization  Reduced footprint allows NiFi functionality to aid in production of high fidelity data, more closely attributable and tracked from where it is generated We’ve provided a framework to extend the reach of data ingest
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi: The Ecosystem  Enter the MiNiFi Command & Control – Provide tooling to map the UX of interactive command and control in NiFi to the design and deploy approach of MiNiFi But more instances complicate my operational management! https://cwiki.apache.org/confluence/display/MINIFI/MiNiFi+Command+and+Control
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi: The Ecosystem  Configuration Management of Flows & Versioning – The evolution of templates to better support SDLC functions – https://cwiki.apache.org/confluence/display/NIFI/Configuration+Management+of+Flows  Extension Repositories – Publish & Share extension bundles (NARs) – https://cwiki.apache.org/confluence/display/NIFI/Extension+Repositories+%28aka+Extension+Regis try%29+for+Dynamically-loaded+Extensions  Variable Registry – Initial framework support & file-based implementation – https://cwiki.apache.org/confluence/display/NIFI/Variable+Registry Building on efforts for reusable components in the community
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Apache NiFi & MiNiFi?  Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes – Think of our courier example and organizations like it: inter vs intra, domestically, internationally  Provide common tooling and extensions that are commonly needed but be flexible for extension – Leverage existing libraries and expansive Java ecosystem for functionality – Allow organizations to integrate with their existing infrastructure  Empower folks managing your infrastructure to make changes and reason about issues that are occurring – Data Provenance to show context and data’s journey – User Interface/Experience a key component
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Learn more and join us! Apache NiFi site https://nifi.apache.org Subproject MiNiFi site https://nifi.apache.org/minifi/ Subscribe to and collaborate at dev@nifi.apache.org users@nifi.apache.org Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI https://issues.apache.org/jira/browse/MINIFI Follow us on Twitter @apachenifi
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Crash Course Thursday, 6 April 11:15 AM – 1:45PM, Room 12 • Learn more about NiFi, the community, and work through a hands-on lab • Seats available on a first come, first served basis • Make sure you are in possession of the latest version of VirtualBox • More details: https://tinyurl.com/nifi-cc-munich17
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Learn, Share at Birds of a Feather IOT, STREAMING & DATA FLOW Thursday, April 6 5:50 pm, Room 5
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You