Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Summit Tokyo Apache NiFi Crash Course

1,592 views

Published on

Hadoop Summit Tokyo Apache NiFi

Published in: Technology
  • Be the first to comment

Hadoop Summit Tokyo Apache NiFi Crash Course

  1. 1. Apache NiFi Crash Course Intro Rafael Coss - @racoss Hadoop Summit – Tokyo Oct 2016
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Data Flow & Streaming Fundamentals What is dataflow and what are the challenges? Apache NiFi Architecture Lab
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Flow & Streaming Fundamentals
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Connected Data World  Internet of Anything (IoAT) – Wind Turbines, Oil Rigs, Cars – Weather Stations, Smart Grids – RFID Tags, Beacons, Wearables  User Generated Content (Web & Mobile) – Twitter, Facebook, Snapchat, YouTube – Clickstream, Ads, User Engagement – Payments: Paypal, Venmo 44ZB in 2020
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Let’s Connect A to B Producers A.K.A Things Anything AND Everything Internet! Consumers • User • Storage • System • …More Things
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Stream Processing? Batch Processing • Ability to process and analyze data at-rest (stored data) • Request-based, bulk evaluation and short-lived processing • Enabler for Retrospective, Reactive and On-demand Analytics Stream Processing • Ability to ingest, process and analyze data in-motion in real- or near-real-time • Event or micro-batch driven, continuous evaluation and long-lived processing • Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best Action Stream Processing + Batch Processing = All Data Analytics real-time (now) historical (past)
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Modern Data Applications Custom or Off the Shelf Real-Time Cyber Security protects systems with superior threat detection Smart Manufacturing dramatically improves yields by managing more variables in greater detail Connected, Autonomous Cars drive themselves and improve road safety Future Farming optimizing soil, seeds and equipment to measured conditions on each square foot Automatic Recommendation Engines match products to preferences in milliseconds DATA AT REST DATA IN MOTION ACTIONABLE INTELLIGENCE Modern Data Applications Hortonworks DataFlow Hortonworks Data Platform
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Store Data Process and Analyze Data Acquire Data Simplistic View of DataFlows: Easy, Definitive Dataflow
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Unassuming Line: A Case Study We’ve seen a few lines show up in the wild thus far Internet! Inter- & Intra- connections in our global courier enterprise Spotlight: Arthur Lacôte, https://thenounproject.com/turo/
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dataflow Line Anatomy 101 Let’s dissect what this line typically represents Fig 1. Lineus Worldwidewebus. Common Name: Internet! Script or Application Script or Application Data Data Disparate Transport Mechanisms
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dataflow Line Anatomy 201 Sometimes that transport is just more lines Fig 1. Lineus Worldwidewebus. Common Name: Internet! Script or Application Script or Application Line Inception Data Data
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Realistic View of Dataflows: Complex, Convoluted Store Data Process and Analyze Data Acquire Data Store DataStore Data Store Data Store Data Acquire Data Acquire Data Acquire Data Dataflow
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming Architecture Ingestion Simple Event Processing Engine Stream Processing DestinationData Bus
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved High-Level Overview IoT Edge (single node) IoT Edge (single node) IoT Devices IoT Devices NiFi Hub Data Broker Column DB Data Store Live Dashboard Data Center (on premises/cloud) HDFS/S3 HBase/Cassandra
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Live Demo Community
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Moving data effectively is hard Standards: http://xkcd.com/927/
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why is moving data effectively hard?  Standards  Formats  “Exactly Once” Delivery  Protocols  Veracity of Information  Validity of Information  Ensuring Security  Overcoming Security  Compliance  Schemas  Consumers Change  Credential Management  “That [person|team|group]”  Network  “Exactly Once” Delivery
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs Let’s consider the needs of a courier service Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Core Data Center at HQ Server Cluster On Delivery Routes Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Great! I am collecting all this data! Let’s use it! Finding our needles in the haystack Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Kafka Core Data Center at HQ Server Cluster Others Storm / Spark / Flink / Apex Kafka Storm / Spark / Flink / Apex On Delivery Routes Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs Oh, that courier service is global
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Live Demo Community
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Capabilities /Gaps Use cases collected from the field since last release (HDF 1.2) Major business drivers behind the use case Problems, challenges and major pain points How does NiFi help solve the problems What are the remaining gaps Use Cases
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi High Level Capabilities  Web-based user interface – Design, control, feedback & monitoring  Highly configurable – Loss tolerant vs guaranteed delivery – Low latency vs high throughput – Dynamic prioritization – Flow can be modified at runtime – Back pressure  Data provenance – Track dataflow from beginning to end  Designed for extension – Build your own processors  Secure – SSL, SSH, HTTPS, etc.
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Deeper Ecosystem Integration: 170+ Processors HTTP Syslog Email HTML Image Hash Encrypt Extract TailMerge Evaluate Duplicate Execute Scan GeoEnrich Replace ConvertSplit Translate HL7 FTP UDP XML SFTP Route Content Route Context Route Text Control Rate Distribute Load AMQP
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Revisit: Courier service from the perspective of NiFi Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Core Data Center at HQ Server Cluster Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/ NiFi NiFi NiFi NiFi NiFi NiFi On Delivery Routes
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Courier service from the perspective of NiFi & MiNiFi Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Core Data Center at HQ Server Cluster Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/ Client Libraries Client Libraries MiNiFi MiNiFi NiFi NiFi NiFi NiFi NiFi NiFi Client Libraries On Delivery Routes
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Subproject: MiNiFi  Let me get the key parts of NiFi close to where data begins and provide bi-directional communication  NiFi lives in the data center. Give it an enterprise server or a cluster of them.  MiNiFi lives as close to where data is born and is a guest on that device or system
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visual Command and Control vs. Design and Deploy
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Managed Dataflow SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi is based on Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FlowFiles & Data Agnosticism  NiFi is data agnostic!  But, NiFi was designed understanding that users can care about specifics and provides tooling to interact with specific formats, protocols, etc. ISO 8601 - http://xkcd.com/1179/ Robustness principle Be conservative in what you do, be liberal in what you accept from others“
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FlowFiles are like HTTP data HTTP Data FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT ETag: "45b6-834-49130cc1182c0" Accept-Ranges: bytes Content-Length: 13 Connection: close Content-Type: text/html Hello world! Standard FlowFile Attributes Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'fileSize’ Value: '23609' FlowFile Attribute Map Content Key: 'filename’ Value: '15650246997242' Key: 'path’ Value: './’ Binary Content * Header Content
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The need for data provenance For Operators • Traceability, lineage • Recovery and replay For Compliance • Audit trail • Remediation For Business / Mission • Value sources • Value IT investment BEGIN END LINEAGE
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Provenance– Improved Navigation and Clearer Interaction • Tracks data at each point as it flows through the system • Records, indexes, and makes events available for display • Handles fan-in/fan-out, i.e. merging and splitting data • View attributes and content at given points in time
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Live Demo Community
  38. 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zero-master Clustering Framework
  39. 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi vs MiNiFi Java Processes NiFi Framework Components MiNiFi NiFi Framework User Interface Components NiFi
  40. 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MiNiFi Java agent  Java implementation  Availability – GA HDF 2.0 (built from scratch, ~ 10MB) Native agent  C++ implementation  Availability – TP HDF 2.0 – GA post HDF 2.0  Resource efficient (focus on memory and disk) Near term (HDF 2.0)  Design & deploy – Push updates – Config file driven/REST API (MiNiFi API – post configurations and receive information, etc.) access Long term  Centralized command and control MiNiFi Agent MiNiFi Management
  41. 41. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why NiFi?  Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes – Think of our courier example and organizations like it: inter vs intra, domestically, internationally  Provide common tooling and extensions that are commonly needed but be flexible for extension – Leverage existing libraries and expansive Java ecosystem for functionality – Allow organizations to integrate with their existing infrastructure  Empower folks managing your infrastructure to make changes and reason about issues that are occurring – Data Provenance to show context and data’s journey – User Interface/Experience a key component
  42. 42. NiFi Traffic Patterns Demo
  43. 43. NiFi Traffic Patterns Lab
  44. 44. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Smart Cities: Traffic Congestion  Monitor:  Public transportation vehicles  Pedestrian levels  Optimize public transit duration and walking routes
  45. 45. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our Lab for Today  We will be exploring some examples to work through creating a dataflow with Apache NiFi  Use Case: An urban planning board is evaluating the need for a new highway, dependent on current traffic patterns, particularly as other roadwork initiatives are under way. Integrating live data poses a problem because traffic analysis has traditionally been done using historical, aggregated traffic counts. To improve traffic analysis, the city planner wants to leverage real-time data to get a deeper understanding of traffic patterns. NiFi was selected for for this real-time data integration.  Labs are available at http://tinyurl.com/nificrashcourse
  46. 46. Getting Started Resources
  47. 47. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Connected Data Architecture with HDC for AWS C L O U D Ideal Use Cases: Data Science and Exploration (Spark, Zeppelin) ETL and Data Preparation (Hive, Spark) Analytics and Reporting (Hive2 w/LLAP, Zeppelin) Cloud Data Processing (HDC for AWS) Technical Preview hortonworks.github.io/hdp-aws
  48. 48. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Learn more and join us! Apache NiFi site http://nifi.apache.org Subproject MiNiFi site http://nifi.apache.org/minifi/ Subscribe to and collaborate at dev@nifi.apache.org users@nifi.apache.org Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI Follow us on Twitter @apachenifi
  49. 49. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Tutorials  Get Started – hortonworks.com/tutorials – Apache Hadoop & Ecosystem • tinyurl.com/hello-hdp – Apache Spark • tinyurl.com/hwx-spark-intro – Apache NiFi • tinyurl.com/nifi-intro – Use Case • IoT • Social Media
  50. 50. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Nourishes the Community H ORTONWOR KS COMMUNITY CONNEC TION HORTONWOR KS PARTNE RWOR KS
  51. 51. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Want to continue the technical Introduction?  Hadoop Summit Crash Courses – Replays – Free  hadoopsummit.org/san-jose/agenda – Apache Hadoop – Apache Spark – Apache NiFi – IoT & Streaming – Data Science
  52. 52. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions? rafael@hortonworks.com @racoss
  53. 53. 55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you!
  54. 54. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Demo Community
  55. 55. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Matured at NSA 2006-2014 Brief history of the Apache NiFi Community • Contributors from Government and several commercial industries • Releases on a 6-8 week schedule Code developed at NSA 2006 Today Achieved TLP status in just 7 months July 2015 Code available open source ASL v2 November 2014
  56. 56. 58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MiNiFi Prospective Plans - Centralized Command and Control  Design at a centralized place, deploy on the edge – Flow deployment – NAR deployment – Agent deployment  Version control of flows  Agent status monitoring  Bi-directional command and control Centralized management console with a UI

×