Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

You Can't Search Without Data

242 views

Published on

As Apache Solr becomes more powerful and easier to use, the accessibility of high quality data becomes key to unlocking the full potential of Solr’s search and analytic capabilities. Traditional approaches to acquiring data frequently involve a combination of homegrown tools and scripts, often requiring significant development efforts and becoming hard to change, hard to monitor, and hard to maintain. This talk will discuss how Apache NiFi addresses the above challenges and can be used to build production-grade data pipelines for Solr. We will start by giving an introduction to the core features of NiFi, such as visual command & control, dynamic prioritization, back-pressure, and provenance. We will then look at NiFi’s processors for integrating with Solr, covering topics such as ingesting and extracting data, interacting with secure Solr instances, and performance tuning. We will conclude by building a live dataflow from scratch, demonstrating how to prepare data and ingest to Solr.

Published in: Software
  • Be the first to comment

You Can't Search Without Data

  1. 1. You Can’t Search Without Data Bryan Bende – Staff Software Engineer @Hortonworks NYC Solr/Lucene Meetup – December 7th 2017
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda à The Problem à Apache NiFi Overview à Integration between NiFi & Solr à Recent & Future Work à Demo Cool Stuff! à Q&A
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Me à Staff Software Engineer @ Hortonworks à Apache NiFi PMC & Committer à Contributed Solr processors in March 2015 – https://issues.apache.org/jira/browse/NIFI-461 à bbende@hortonworks.com / Twitter @bbende / bryanbende.com
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Problem
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Team 2 It starts out so simple… Hey! We have some important data to send you! Cool! Your data is really important to us! Team 1 This should be easy right?...
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved But what about formats & protocols? Team 2 We can publish Avro records to a Kafka topic, does that work? Oh, well we have a REST service that accepts JSON… Team 1
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved And what about security & authentication? Team 2 Hmm what about security? We can authenticate via Kerberos Sorry, we only support 2-Way TLS with certificates Team 1
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved And what about all these devices at the edge? We also need to grab data from all these devices, how are we going to do that? Team 2
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Wouldn’t it be nice if there was a tool that could help these teams?
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enter Apache NiFi…
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi • Created to address the challenges of global enterprise dataflow • Key features: – Visual Command and Control – Data Lineage (Provenance) – Data Prioritization – Data Buffering/Back-Pressure – Control Latency vs. Throughput – Secure Control Plane / Data Plane – Scale Out Clustering – Extensibility
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Core Concepts FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visual Command & Control • Drag & drop processors to build a flow • Start, stop, & configure components in real-time • View errors & corresponding messages • View statistics & health of the dataflow • Create shareable templates of common flows
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Provenance/Lineage • Tracks data at each point as it flows through the system • Records, indexes, and makes events available for display • Handles fan-in/fan-out, i.e. merging and splitting data • View attributes and content at given points in time
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Prioritization • Configure a prioritizer per connection • Determine what is important for your data – time based, arrival order, importance of a data set • Funnel many connections down to a single connection to prioritize across data sets
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Back-Pressure • Configure back-pressure per connection • Based on number of FlowFiles or total size of FlowFiles • Upstream processor no longer scheduled to run until below threshold
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Latency vs. Throughput • Choose between lower latency, or higher throughput on each processor • Higher throughput allows framework to batch together all operations for the selected amount of time for improved performance • Processor developer determines whether to support this by using @SupportsBatching annotation
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture - Standalone OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage à FlowFile Repository – Write Ahead Log – State of every FlowFile – Pointers to content repository (pass-by-reference) à Content Repository – FlowFile content – Copy-on-write à Provenance Repository – Write Ahead Log + Lucene Indexes – Store & search lineage events
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Architecture - Cluster OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage ZooKeeper à Same dataflow on each node, data partitioned across cluster à Access the UI from any node à ZooKeeper for auto-election of Cluster Coordinator & Primary Node à Cluster Coordinator receives heartbeats from other nodes, manages joining/ disconnecting à Primary Node for scheduling processors on a single node
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi & Solr
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Solr Processors à Support Solr Cloud and stand-alone Solr instances à Leverage SolrJ (CloudSolrClient & HttpSolrClient) à GetSolr – Extract new documents à PutSolrContentStream – Stream flow file content to an update handler
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved PutSolrContentStream à Choose Solr Type – Cloud or Standard à Specify ZooKeeper hosts, or the Solr URL with core à Specify the Solr path for the update handler à Dynamic Properties sent as key/value pairs on request à Relationships for success, failure, and connection failure
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GetSolr à Incrementally extract new documents à Main query is *:*, Solr Query is optional filter query à Date Field used as filter query, from last execution or initial value à Sorted by date field and unique key à Cursor mark used behind the scenes à Specify return fields, or all if blank à Output Solr XML, or Records
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting With a Secure Solr à Basic Auth – Provider username/password à Kerberos – Set JAAS system property in bootstrap.conf – Provide name of JAAS entry for processor to use à TLS/SSL – Provide an SSL Context Service – One-way TLS with Truststore only – Two-way TLS with Keystore + Truststore
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recent & Future Work
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Problem – Conversion Between Data Formats à Specialized processors to operate on different data types à Sometimes missing conversions à Sometimes missing a specific function for a data type à Sometimes implemented with different libraries causing inconsistencies
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Solution – Record Processing à Introduce the concept of a ”record” – Released in Apache NiFi 1.2.0 (May 2017), improvements in 1.3.0 and 1.4.0 à Centralize the logic for reading/writing records into controller services – Readers/Writers for CSV, Json, Avro, etc. à Provide standard processors that operate on records – ConvertRecord, QueryRecord, PartitionRecord, UpdateRecord, etc. à Provide integration with schema registries – Local Schema Registry, Hortonworks Schema Registry, Confluent Schema Registry à Can still handle arbitrary data, but process records when appropriate
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Problem – Variable Handling à Need to parametrize values in the flow per environment – Connection strings, URLs, File System paths, etc. à Can set variables in bootstrap.conf – -Dmy.var=foo à Can set a properties file in nifi.properties – nifi.variable.registry.properties=production.properties à Both require command line access à Both require restart to pick up changes
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Solution – First Class Variable Registry à Variables associated with a process group, released in 1.4.0 à Right-click on canvas to view variables for current group à Hierarchical order of precedence, resolve closest reference to component à Editing variables automatically restarts any components referencing the variables
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Problem – How do I deploy my flow? à Most organizations want the classic development lifecycle (dev -> int -> prod) à Can copy flow.xml.gz between environments – Requires copying entire data flow – Can’t tell what changed, hard to diff if you put in version control – Requires all environments use the same encryption key for sensitive properties à Can make templates for portions of the flow – Script creation of template and deployment to next environment – Requires stopping flow and removing components, then re-instantiating template – No easy way to see changes, hard to rollback
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Solution – NiFi Registry à DISCLAIMER - UNDER DEVELOPMENT & NOT RELEASED YET! à Complimentary application, sub-project of Apache NiFi – https://github.com/apache/nifi-registry – https://issues.apache.org/jira/projects/NIFIREG à Central location for storage/management of shared resources across NiFi instances à Initial capability to store and retrieve “versioned flows” à A versioned flow is a snapshot of a process group at a given point in time à Potentially store extensions, shared data sets, and more in the future
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DEMO!!
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example Scenario à User data – https://randomuser.me à Initially in CSV format – name.title,name.first,name.last,email,registered – mr,dennis,reyes,dennis.reyes@example.com,2012-04-10 01:54:19 – miss,carole,gomez,carole.gomez@example.com,2002-12-17 22:15:49 à Requirements – Convert CSV to JSON – Add a full_name field with first name + last name – Add a gender field based on title (i.e. if title == mr then MALE) – Ingest to different Solr collections depending on environment
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions?
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Learn more and join us! Apache NiFi site http://nifi.apache.org Subscribe to and collaborate at dev@nifi.apache.org users@nifi.apache.org Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI Follow us on Twitter @apachenifi
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you!

×