Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache NiFi - Flow Based Programming Meetup

184 views

Published on

These are the slides from the July 11th Meetup in Toronto for the Flow Based Programming meetup group at Lighthouse covering Enterprise Dataflow with Apache NiFi.

Published in: Software
  • Be the first to comment

Apache NiFi - Flow Based Programming Meetup

  1. 1. Apache NiFi: Enterprise data flow management and FBP © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache, NiFi, Apache NiFi, and the NiFi logo are trademarks of the Apache Software Foundation Joe Witt | July 2017
  2. 2. Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About me Member @ Apache Software Foundation Member @ Apache NiFi PMC VP Engineering @ Hortonworks
  3. 3. Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • The journey to an FBP-like design • Architectural elements for Dataflow Management • Apache NiFi and FBP • Live Demo and discussion
  4. 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved The journey to an FBP like design
  5. 5. The data is over here but I want it over there… Basics of Connecting Systems For every connection, these must agree: 1. Protocol 2. Format 3. Schema 4. Priority 5. Size of event 6. Frequency of event 7. Authorization access 8. Relevance P1 Producer C1 Consumer
  6. 6. Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  7. 7. It started so simple • Just needed to scan a directory for new data • Send it over the link. But…. • Bandwidth was low, latency high, comms unreliable • Some data was more useful than others • The rules for that could change often • Light-weight in-line analysis could be used to determine relative value • The value of the data decayed rapidly • The data’s raw form was highly inefficient for transport • and large portions of the data could simply be removed in many cases • How to document, maintain and fine tune the configuration? • Infrastructure was highly limited
  8. 8. Challenges at the Edge • Small footprint • Low power • Expensive bandwidth • High latency • Access to data exceeds bandwidth (if you're doing it right) • Needs recoverability • Needs to be secured for both the data plane and control plane GATHER DELIVER PRIORITIZE Track from the edge Through the datacenter
  9. 9. Simplistic View of Enterprise Data Flow The Data Flow Thing Process and Analyze Data Acquire Data Store Data
  10. 10. Realistic View of Enterprise Data Flow ? ? ? ? ? ? ?
  11. 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Dataflow Management
  12. 12. Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dataflow Management The systematic process by which data is acquired from all producers and delivered to all consumers
  13. 13. Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dataflow Management Considerations • Promote Loosely Coupled Systems • Types of coupling: Format, Schema, Protocol, Priority, Size, Interest, … • Promote Highly Cohesive Systems • Producers should focus on production (not the intricacies of consumption) • Consumers should focus on storage or processing (not the details of production) • Provide Provenance • The who/what/when/where/why of data • Inter and Intra Process Latency • Enable enterprise version control for data
  14. 14. Page14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dataflow Management Considerations • Empower Understanding and Interaction • Ability to see the flow, safely and quickly iterate and experiment • Breaking production is bad – so too is not being able to evolve fast enough • Secure • Bridge between security domains • Data Plane (transport) • Control Plane (C&C, Monitoring) • Self Service • Centralized teams – hard to scale – slow turnaround times • Centralized systems – multi-tenant management works
  15. 15. Page15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The role of messaging systems • Reduce variables: Fix protocol, Data Size, Provide Buffering • Historically not very fast or replayable: Apache Kafka solved that • Strong solution within a controlled domain • But numerous challenges remain • Topics do not separate key concerns between producer and consumer pairs such as – Authorization – Format – Schema – Interest – Prioritization • Flow control (back-pressure, pressure-release, filtering, etc..)
  16. 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache NiFi – Built for Dataflow Managment
  17. 17. Page17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The NSA Years • Created in 2006 • Improved over eight years • Simple Initial vision – Visio for real-time dataflow management • Key Lessons Learned • What scale means – down, up, and out • The fearsome force known as Compliance Requirements • The power of provenance! • Operational best-practices and anti-patterns • NSA donated the codebase to the ASF in late 2014
  18. 18. Page18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ”Maintainability is the real test.” - J Paul Morrison
  19. 19. Page19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  20. 20. Page20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi and FBP concept mapping FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. NiFi does not have the concept of bracket/data IP. Just IP. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. NiFi does not have the concept of named input ports on black boxes. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via named input/output ports. A process group allows creation of entirely new component simply by composition of its components.
  21. 21. Page21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Architecture Single Node Cluster One or more nodes
  22. 22. Page22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Architecture – Repositories – Pass by reference FlowFile Content Provenance F1à C1 C1 Excerpt of demo flow… What’s happening inside the repositories… BEFORE AFTER F2à C1 C1 P3à F2 – Clone (F1) F1à C1 P2à F1 – Route P1à F1 – Create P1à F1 – Create
  23. 23. Page23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Architecture – Repositories – Copy on Write FlowFile Content Provenance F1à C1 C1 P1à F1 - CREATE Excerpt of demo flow… What’s happening inside the repositories… BEFORE AFTER F1à C1 F1.1à C2 C2 (encrypted) C1 (plaintext) * P2à F1.1 - MODIFY P1à F1 - CREATE * C1 (plaintext) is now eligible to be removed. But if we keep it around as long as possible what cool things can we do?
  24. 24. Page24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Security – at a high level Authentication Authenticate users and systems • TLS one-way or mutual auth, Username/Password via LDAP, Kerberos/SPNEGO – out of the box Authorization Provision access to data • Pluggable authorization • Simple file-based authority provider OR Apache Ranger based provider out of the box • Fine-grained rights assignment per action/component for users and groups Audit Maintain a record of data access • Detailed logging of all user actions • Detailed logging of all REST API interactions (person or non-person) • Detailed logging of key system behaviors • Data Provenance enables fine-grained end to end tracking Data Protection Protect data at rest and in motion • Support a variety of SSL/encryption protocols • Tag and utilize tags on data for fine grained access controls • Encrypt/decrypt content • TDE for Provenance repository (content repository and flowfile WAL work underway!)
  25. 25. Page25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions • Apache NiFi – Join the community! • Feature Requests • Bug Reports • Code Contributions • Peer Reviews • Documentation https://nifi.apache.org

×